


7 £ 
- 
7 « 
% o 
Li 4 
. n 
es i , 
as | i na 
; rn 
4 
rs - 


DUKE UNIVERSITY 
DURHAM, N. C. 





UKE UNIVERSITY RESEARCH STUDIES IN EDUCATION . NUMBER 2 


VARIABILITY IN RESULTS FROM 
NEW-TYPE ACHIEVEMENT 
TESTS 


By EARL V. PULLIAS 
Duke University 





DUKE UNIVERSITY PRESS 
Durham, N. C. 
1937 





DUKE UNIVERSITY RESEARCH STUDIES IN EDUCATION 


VARIABILITY IN RESULTS FROM 
NEW-TYPE ACHIEVEMENT TESTS 


Digitized by the Internet Archive 
in 2021 with funding from 
Duke University Libraries 


httos://archive.org/details/variabilityinresO1 pull 


VARIABILITY IN RESULTS FROM 
NEW-TYPE ACHIEVEMENT 
TESTS 


By EARL V. PULLIAS 


Duke University 
WITH A FOREWORD BY 


WILLIAM A. BROWNELL 


Duke University 


DUKE UNIVERSITY PRESS 
Durham, N. C. 
WEY 





FOREWORD 


In this monograph, the second of the Duke University Research 
Studies in Education, Dr. Pullias presents important new evidence 
on the variability of so-called objective measures of school achieve- 
ment. The new-type tests used, both teacher-made and standard, were 
given under the normal conditions of classroom measurement, and in 
all probability similar results would be obtained in any school em- 
ploying comparable tests. 

In this monograph the author reports his experimental procedures 
and his findings at some length—perhaps at greater length than is 
needed by the reader who is familiar with this type of investigation. 
Such a reader will be able to find what he wants in the tables and 
can safely neglect many parts given over to exposition. But Dr. 
Pullias is writing chiefly for a different audience. He has in mind 
school officials and teachers who are using new-type tests in ways 
which may or may not be justified. Such readers require more than 
a bare outline of what was done and what was found. Accordingly, 
Dr. Pullias, careful in his experimental work to use only simple 
techniques which can be duplicated in any school situation, has, in 
this report, been equally careful to provide enough explanation and 
interpretation to guarantee real understanding. 

Dr. Pullias holds no brief for any particular kind of test. He is 
concerned only with finding the degree to which different new-type 
tests, designed to measure the same educational products, agree or dis- 
agree in the measures they yield. As a matter of fact, the large 
amount of disagreement found calls into question a number of educa- 
tional practices now based upon results from such tests. 

In his last chapter Dr. Pullias discusses briefly the factors mak- 
ing for variability in new-type tests. Here the student of educational 
measurement will find.much to challenge his thought. He will wel- 
come too the several testable hypotheses proposed for further re- 
search. Some of these hypotheses Dr. Pullias is himself investigating. 


WittiaM A, BROWNELL, 


Co) 
mW 
Cs 
pak 
Co 


ACKNOWLEDGMENTS 


In reporting this investigation I wish to express my apprecia- 
tion to those who have assisted me in the course of its preparation. 
The criticisms and suggestions of members of the faculty of the 
Department of Education of Duke University have been very helpful. 
I am especially indebted to the following persons for assistance in 
connection with this study: Dr. William A. Brownell, Professor of 
Educational Psychology, Duke University; Dr. Howard Easley, 
Assistant Professor of Educational Psychology, Duke University ; 
and Dr. Douglas E. Scates, Director of the Bureau of School Re- 
search, Cincinnati Public Schools. 

I desire to express my appreciation to Mr. W. F. Warren, Super- ' 
intendent of Durham City Schools, and to Mr. L. H. Barbour, Super- 
intendent of Durham County Schools, for their permitting me to 
carry out the study in the city and county schools and for their 
warm interest in the research. I am grateful to Mrs. J. A. Robinson, 
Primary Supervisor, Durham City Schools, for her assistance with 
the administration of the standardized tests, and for her suggestions 
concerning other.aspects of the study. I wish to thank the prin- 
cipals of the schools.in which the study was made for their interested 
co-operation. I am especially indebted to each of the thirty-five 
teachers, who constructed and administered informal new-type tests, 
for their efficient and friendly co-operation, without which the 
teacher-made part of the investigation would have been impossible. 

Funds for carrying out some parts of the research here reported 
were granted by the Duke Research Council. I am deeply grateful 
for this assistance. 


E: Wak: 


TABLE OF CONTENTS 


PAGE 
BMG AEERS ANI) FIGURES. . < o.<-<ncsceu.ecccaciececs seen 9 
PART I, INTRODUCTION 
CHAPTER 
I. THE PRoBLEM AND THE PRoposED TREATMENT........ 13 
iA Brrer History oF THE PROBLEM .................. 19 
PART II. THE INVESTIGATION 
A. Informal or Teacher-Made Tests 
III. MetHops UseEp IN THE Stupy oF TEACHER-MADE 
LL TIRST We ROS i | ne 26 
IV. CorrRELATION ANALYSIS OF RESULTS FROM PAIRED 
INFORMAL OR TEACHER-MADE OBJECTIVE TESTS....... 38 
V. PERCENTILE Point ANALYSIS OF RESULTS FROM 
PAIRED ENFORMAL OBJECTIVE TESTS. ........000.c0050 48 
VI. TEAcHER-Mark ANALYSIS OF INFORMAL OBJECTIVE 
Dp SOE stm oe Se ee cee cade le. 58 
B. Commercial Standardized Tests 
VII. DEscripTIoN oF ProcepurEs Usep 1n Stupy oF 
PMMRPIAMINZ ED UESTS!. ./4c's< cc ss cecre bee oe dense cen e ble 65 
VIII. CorrELaTIon ANALYSIS OF STANDARDIZED 
POMMERCEAT, LesT RESULTS. <<... .0«0cosocusccedk 69 
IX. ANALysIs oF STANDARDIZED Test FINDINGS IN 
TERMS OF GRADE-EQUIVALENT DISPARITY............ 74 
Poa SUMMARY AND CONCLUSIONS. .......-0.ccecccecccces 78 


PART III. PRACTICAL IMPLICATIONS AND THEORETICAL PROBLEMS 


XI. EpucaTionat IMPLICATIONS AND PROBLEMS FOR 


FURTHER RESEARCH........... Bret N peeks acetone 81 
eS MII i ke eta a uty a lh 92 
EeSRONSRAIPEEY 90055 6a. ob ona oes oo See aft nN A ates eee ee ne 98 


ad 


CAD 
tx 
Cad 
poe 
Go 





TABLE 


+I! 


: 


IV. 


VI. 


VII. 


VIII. 


IX. 


XI. 


XII. 


XIII. 


XIV. 


XV. 
XVI. 
XVII. 


XVIII. 


XIX. 


LIST OF TABLES AND FIGURES 


fbercentarel Scores on dest Viand™ Lest Derye cele niciclerevsiels el«(sicle ¢ 


Number and Type of Items in Each of Sixty-three Teacher- 
Maden @ by ectiv curliestSirrsrsiciavereietovete llc crerereia cresamieyaveyeielstalel e)sls/eleteve 


Summary of Facts Which Pertain to Important Characteristics 
Ota leacher-Mader Objective (DeEsts...1. cele cletetstdcrele sieicnclevsioie.c\eis'« 


Teachers, Tests, Pupils, and Test Papers Involved in the Study 
of Teacher-Made Objective Tests Which Cover Shorter Units 
CSM et tetsi cll Npev sete rei a asi isos ale olaiere ciovehecaiels esellareiot ele fete) cuska Sines civtesons 


. Teachers, Tests, Pupils, and Test Papers Involved in Study of 


Teacher-Made Objective Tests Which Cover a Semester of 
DUNC amp aVera etre alist s acovevels aleveleis aisis aie iaxerelere.d svereiers] cyeleialase avers 


Sample Set of Data Illustrating Types of Analyses Used i in the 
Study of Teacher-Made Objective Tests..........:.....-+++ 


Fifth Grade Geography—Correlation between Informal Ob- 
jective Tests Which Cover Identical Text Material........... 


Sixth Grade Geography—Correlation between Informal Ob- 
jective Tests Which Cover Identical Text Material.......... 


Fifth Grade History—Correlation between Informal Objective 
Tests Which Cover Identical Text Material.................. 


. Sixth Grade History—Correlation between Informal Objective 


Tests Which Cover Identical Text Material.................. 


High School History—Correlation between Informal Objective 
Tests Which Cover Identical Text Material.................. 


Summary of Correlations between Tests Which Cover Identical 
Mexia Vie teriallecraerier. iets cicietecctecerer nee lerereverusyarelel svete er encvelores cele 


Seventh Grade Geography—Correlation between Informal Ob- 
jective Tests Which Cover a Semester of Work.............. 


Seventh Grade History—Correlation between Informal Ob- 
jective Tests Which Cover a Semester of Work............. 


Mean Correlation for All Paired Tests in Geography......... 
Mean Correlation for All Paired Tests in History............ 


Summary of Correlations between Paired Tests for All Sub- 
11GGES etal JNU Ere WA Bobencoconok obogadadonboopopoccedde 


Comparison of Distributions of Correlations for Traditional 
Tests and for Objective or New-Type Tests...............0+. 


Fifth Grade Geography—Disparity in Terms of Percentile 
Points between Informal Objective Tests Which Cover Iden- 
HI CAlEMEXtH Victtentaleelserctystese lle ciaiclsiciasteveisielelelasehoreveystereyecelevelernvele 


92 


30 


34 


35 


36 


39 


39 


40 


41 


42 


43 


43 
45 
45 


47 


10 


TABLE 


XX. 


XXI. 


XXII. 


XXIII. 


XXIV. 


XXV. 


XXVI. 
XXVII. 
XXVIII. 


XXIX. 


XXX. 


XXXI. 


XXXII. 


XXXIII. 


XXXIV. 


XXXV. 


List of Tables and Figures 


Summary Distribution—Percentile Point Disparity for All 
Tests Which Cover Identical Text Material................-- 
Summary of Means—Percentile Point Disparity for All Tests 
Which Cover Identical Text Material..........c.csesesvuueer 
Summary Distribution—Percentile Point Disparity for All 
Tests Which Cover a Semester of Work............0.ss000- 


Summary of Means—Percentile Point Disparity for All Tests 
Which Cover a Semester of Work............. anleeeienentenienans 


All Geography Tests—Summary of Mean Percentile Point 
Disparity .ocsc.0cccveesceeve esr cvcises sus 6 sett 


All History Tests—Summary of Mean Percentile Point 
DiSParity 6 sis so.0 0.0.2.0 0/0 2,019 als /0) 9.0100 sis) e\e:01efa/n 91st eaten ene er 


All Tests—Summary Distribution of Percentile Point Disparity 
All Tests—Summary of Means of Percentile Point Disparity 


Fifth Grade Geography—Disparity in Terms of Teachers’ 
Marks between Informal Objective Tests Which Cover Iden- 
tical Text Material 6. 0.cccccci «'s0iecieeie «viele ol eaten 


Summary Distribution—Teacher-Mark Disparity for All Tests 
Which Gover Identical Text Material...) 2. ..0seeeanennnee 


Summary of Means—Teacher-Mark Disparity for All Tests 
Which Cover Identical Text Material. .°%..).0.cceneenenenete 


Summary Distribution—Teacher-Mark Disparity for All Tests 
Which’ Gover a Semester of Work: 2.0... 3... 0.9 seen 


Summary of Means—Teacher-Mark Disparity for All Tests 
Which’ Gover a Semester of Work... .5.....0-seeteiteeennee 


All Geography Tests—Summary Distribution of Teacher- 
Mark Disparity (0c 160 o:5s00.s:tie bers'¢.0'e «ee. siele a, 51 ai16 9) eae 


All History Tests—Summary Distribution of Teaches 
Disparity  s.ascigicje oeisveis sire stole dle avete sieseleeseie ese 6 arora 


All Tests—Summary Distribution of Teacher-Mark Disparity 


XXXVI. All Tests—Summary of Means of Teacher-Mark Disparity.... 


XXXVII. 


XXXVIII. 
XXXIX, 
xee 


Sub-Test Composition of Metropolitan, New Stanford, and 
Public Schooll Achtevement ests)... + + -- «sie oe eee 


Method of Rotation Used in Administering Standardized Tests 
Correlation Table—Reading Tests a and b..............+00:- 


Correlations between Standardized Achievement Tests Which 
Were Designed to Measure the Same Abilities............... 


XLI. Correlations from Foran’s and Loyes’ Study of Three Standard- 


ized) “Testsisc nsiccancadolcccccmsmieccieiet ee he nO ee eae 


PAGE 


51 


51 


54 


56 


55 


55 
56 
56 
59 
60 
60 


61 


62 


62 


63 
63 
64 


66 
67 
70 


71 


TABLE 


XLII. 


XLIII. 


XLIV. 
EXLV. 


eX VAL. 


EXE VII. 


*XLVIII. 


*XLIX. 


ules 


AEA 


‘Gls 


Sun: 


ELV, 


FIGURE 


_— 


I 
*The table 


— 


List of Tables and Figures 11 


PAGE 


Mean Correlations for Standardized Tests in Various Subjects 
Based upon Data from Foran and Loyes and upon Data from 
HEME TO SEMCAS ELI AVA sterniccre scrote eherarsisaarnve online) ole clos weve wlevare 12, 
Mean Correlations between American History Tests (Stand- 
ardized) Based upon Data from Ruch and Others.......... 73 
Arithmetic Computation—Grade-Equivalent Disparity between 
Mestsvanosandvcan “lerms of (Months: ...sccceens avice ance core 75 
Arithmetic Reasoning—Grade-Equivalent Disparity between 
fiestsea wo pandic.in) erms of Months: as cgare desl oontan« 93 
Geography—Grade-Equivalent Disparity between Tests a, b, 
Liew me riisr Ot MNUOMthS!..avtters eieieciieteres creieyere eiereisieys| sremvelers 94 
Health—Grade-Equivalent Disparity between Tests b and c 
MMe LIM SHO LP MON tliS sree ¢ «cls atetelecelelais pierels e1d pienie saverabalelaistere o\ste:s 94 
History—Grade-Equivalent Disparity between Tests a, b, and c 
AMMEN GMO Ta VEO TIS. sarararate cave cicrcier eis, clevalelers lowe ca lekaisievsievsiareyne 95 
Language Usage—Grade-Equivalent Disparity between Tests 
Quem Gacuttieectns Of North Sis ctaeicinerelatersisinierelelercreye/sie aie) oreo 95 
Literature—Grade-Equivalent Disparity between Tests a and c 
Tite H ITI SMOTES VON tS aie. siofsisteisis ia sje syelstaicnere. alneisielelnestele/erers meres 96 
Reading—Paragraph-Meaning—Grade-Equivalent Disparity be- 
tween Tests a, b, and c in Terms of Months.................. 96 
Spelling—Grade-Equivalent Disparity between Tests a, b, and 
GuITlee Ge LIN SHO fe MOTHS scree ste elovercrstace aie ators /eishe cls avoid « levelenajersreece 97 
Word Meaning—Disparity between Tests a and c in Terms 
Gy md Vo Eb Siepepons teveretere toh ovata rotted cla veicta evelsvahens Gaus scaveueh sperm e aleierececalacese 97 
Summary of Mean Grade-Equivalent Disparity for All Subject 
PGSESMP sets ars Kiel ctec oi atctey adores ialsia erties nea wea Siesta reed Sareiararoaeie’ 76 
mebercentiie: Curve ton ldenticals Umit Mestsirs. cleus ele). 0 areere 50 
Meetcentie Gunve for Semester Tests%)..f.. sc. sce ees ccc 52 


appears in the Appendix. 


we) Tee ah wey Ceo tee sh a 
S » : } hi A | 





PART I. INTRODUCTION 


CHAPTER 


THE PROBLEM AND THE PROPOSED TREATMENT 


Pr" STATEMENT OF THE PROBLEM 


piss investigation was made to determine the extent to which 
so-called objective tests produce equivalent results) The method was 
that of measuring the disparity in the results of comparable objective 
tests administered under comparable conditions/ The findings of the 
investigation offer a tentative answer to the following question: 
When two so-called objective tests designed to measure the same 
thing are used to measure the achievement of the same pupils, to 
what extent will the results secured by the two measuring instru- 
ments vary ? 

Detailéd analysis. The problem may be broken down into more 
specific queries. 1. If two competent teachers construct objective tests 
designed to measure the acquaintance of a group of pupils with a 
given section of text material, and if both of these tests are given 
to the same group of pupils, how will the results of the measure- 
ment secured by the application of one test compare with the results 
when the other test is used? 

a. What will be the correlation between the results of the two 
sets of measurements? When a large number of comparisons are 
made, what will be the average degree of relationship in terms of 
a coefficient of correlation ? 

b. To what extent will the results from the two measurements 
diverge in terms of percentile points? As an illustration, what per 
cent of the group will change percentile position as much as twenty 
percentile points? 

c. How much disparity will there be between the results of the 
two sets of measurements in terms of school marks? For example, 
how many children will make a mark of B on one test and a mark 
of A on the other test? 

2. If two or more standardized tests designed to measure cer- 
tain pupil abilities are administered to the same children under com- 
parable conditions how nearly identical will the results of the meas- 
urement be? 


14 Variability in Results from New-T ype Achievement Tests 


a. What will be the relationship between the two sets of meas- 
ures in terms of a coefficient of correlation? What will be the aver- 
age correlation when a number of subject tests are considered? 

b. To what extent will the results of one test vary from the re- 
sults of another test in terms of grade placement? For example, in 
the case of how many pupils will the score on one test vary from 
the score on another test to the extent of making a difference of 
ten months in grade equivalence? What will be the average amount 
of disparity in terms of school months for given subject tests? 

3. What are the educational implications of the facts revealed in 
answer to the queries listed under 1 and 2 in the foregoing para- 
graphs ? 

4. What are some of the problems for further research which 
are suggested by the findings of this study? 

The problem further analyzed. The objective test received its 
name from the fact that ‘such tests are relatively objective in re- 
gard to scoring. Odell defines such tests in the following manner: 

The term objective is commonly applied to a test which contains exer- 


cises of such a sort that there is no disagreement among competent scorers 
as to what the correct answers are.! 


That is to say objective tests are free from personal judgment only 
in respect to scoring. A part of this phase of objectivity is secured 
because those persons using a test agree to use a given key, as in 
the case of standardized tests. 

But scoring is only one aspect of the total testing situation, In 
fact, it is a relatively simple aspect of the complex problems involved 
in the measurement of psychological phenomena. When the sub- 
jectivity in scoring is eliminated, do other factors which produce 
marked variation in test results remain in the testing situation as 
a whole? If so, what is the extent of the variation? To get a mean- 
ingful answer to this question (that is, if one would inquire into 
the basic elements of objectivity in measurement) one must press 
his inquiry beyond the relatively superficial matter of uniformity 
in scoring. The measurement of the results of education involves 
at least three types of fundamental problem, all of which are closely 
related to objectivity in the total measurement situation. First, there 
are the problems relating to the selection of test materials, the 
formulation of these materials into test items, and the organization 
of these items into a test. Second, certain important problems grow 


1C. W. Odell, Educational Measurement in High School (New York: 
The Century Company, 1930), p. 70. 


The Problem and the Proposed Treatment 15 


out of the psychological nature of the thing measured, the highly 
organized, functioning, dynamic mind of the pupil. Third, impor- 
tant issues relating to the interpretation of the results of measure- 
ment are involved. Each of these major problems comprises a large 
number of more specific issues.? 

This study was not designed as an attack upon these basic causes 
of variation in social or psychological measurement. On the con- 
trary, the investigation was designed to furnish at least a partial 
answer to a question which seemed to be a necessary preliminary 
step to a meaningful attack upon the more fundamental prob- 
lem; namely, when tests which are objective in regard to scoring 
are used as the instrument for testing, to what extent do the test 
results show variation? 


INVESTIGATIONS RELATING TO OBJECTIVE TESTS 


A complete bibliography of the studies that have been made in 
an attempt to illuminate various phases of the measurement prob- 
lem would constitute a volume of considerable proportions. If the 
field is narrowed to researches relating to the new-type or objective 
test, the number of titles continues to be large. Fortunately, ade- 
quate bibliographies of the research literature pertaining to the ob- 
jective test are available. The studies pertaining to this particular 
type of educational measurement which were reported before 1929 
were competently reviewed by Ruch. Later researches have been 
very conveniently summarized by Lee and Symonds.‘ Bibliographies 
covering the general field of educational measurement are also 
available.® 

There have been numerous researches* which relate to the re- 
liability of teachers’ marks, particularly marks based upon the essay 
type test, and to the reliability of test scores. 

? See chap. xi for an extended list of these. 

*G. M. Ruch, The Objective or New-Type Examination (New York: Scott, 
Foresman and Company, 1929). 

“J. Murray Lee and P. M. Symonds, “New-Type or Objective Tests: A 
Summary of Recent Investigations,” The Journal of Educational Psychology, 
XXIV (Jan., 1933), 21-38; XXV (March, 1934), 185-191. : 

°C. W. Odell, A Selected Annotated Bibliography Dealing with Examina- 
tions and School Marks (Bulletin No, 43; Urbana: Bureau of Educational Re- 
search, University of Illinois, 1929) ; H. L. Smith and Wendell William Wright, 
Second Revision of the Bibliography of Educational Measurements (Bulletin 
of the School of Education, Vol. IV, No. 2; Bloomington: Bureau of Co- 
operative Research, Indiana University, 1927) ; “Bibliography on Educational 
Tests and Their Use,” Review of Educational Research, III (Feb., 1933), 
62-80; Vernon Jones and Robert H. Brown, “Educational Tests,” Psychological 


Bulletin, XXXII (July, 1934), 473-499. 
®See Ruch, op. cit., chaps. iii and xi, for a review of the better known studies. 


16 ~=Variability in Results from New-Type Achievement Tests 


The typical procedures used in the former type of research— 
that proposing to determine the reliability of teachers’ marks—are, 
in general, as follows: First, the same teacher scores identical pa- 
pers at varying intervals ;7 second, a number of teachers mark or 
evaluate copies of a single pupil’s paper ;® third, a study is made 
of the distribution of marks given by teachers of different sub- 
jects ;® fourth, marks given to identical children in the same sub- 
ject during consecutive years are compared.!° The degree of varia- 
tion or unreliability is judged by the variation in marks reported. 
One should note that the first and second procedures involve only 
the scoring phase of the testing situation;!4 the third method of 
approach is even less related to the measurement situation as a 
whole, for it confuses the issue by introducing comparisons between 
the results of measurement in different subjects; and the fourth 
type of procedure fails to control properly a number of significant 
factors, such as important psychological changes which result from 
a change from one school to another in the case of the pupils act- 
ing as subjects. Without doubt many or all of these imvestiga- 
tions have contributed something to progress in educational meas- 
urement, but they have discovered neither the basic sources of sub- 
jectivity in such measurements, nor the degree or extent of varia- 
tion or disparity when the method of research used was such as to 
take these fundamental factors into account. 


The second type of research on this problem has centered around 
the more strictly statistical meaning of reliability. Ruch has very 
aptly described this method of approach: 


When two sets of measures of the same ability or function are cor- 
related, we term the resulting coefficient of correlation a reliability coe ffi- 
cient. By “two sets of measures of the same ability or function” we have 
in mind equivalent or comparable forms of the same test, or some closely 
analogous pair of measures. 


If a teacher gives two forms of a standard test or if she administers 
two duplicate examinations, the two sets of scores may be compared by 


7 Daniel Starch, Educational Measurements (New York: The Macmillan 
Company, 1918), p. 9. 

8 Daniel Starch and E. C. Elliott, “The Reliability of Grading High School 
Work in Mathematics,” School Review, XXI (April, 1913), 254-259. 

°F. W. Johnson, “A Study of High School Grades,” School Review, XIX 
(Jan., 1911), 13-24. 

oF J. Kelly, Teachers’ Marks, Their Variability and Standardization 
(Teachers College Contributions to Education, No. 66; New York: Teachers 
College, Columbia University, 1914). 

"See pp. 14 and 15 for analysis of this term. 


The Problem and the Proposed Treatment 17 


correlation, the resulting coefficient in this case being a reliability coe ffi- 
cient. 

There are several ways of obtaining reliability coefficients when we 
are studying examinations: 

1. Two equivalent (or roughly equivalent) tests may be given and the 
results correlated. 

2. A single test may be given, the papers graded independently by two 
teachers, and the two sets of marks are then correlated. 

3. A single examination, graded by a single person, may be broken 
into two half-examinations by some chance method (e. g., taking alter- 
nate items in each half-form), and the halves are then correlated... . 

These three methods are not exactly comparable in meaning, but 
each has its distinct uses in attacking the question of the reliability of ex- 
aminations.12 


Some of these studies!® involved a technique very nearly identi- 
cal to that used in the portion of this study relating to teacher-made 
tests, except that these researches were based upon traditional type 
tests. The coefficients of reliability reported in such studies permit 
interesting comparisons with the results from the corresponding 
phase of this study (Chapter IV). 

The results from a large number of researches!4 which report 
reliability coefficients secured by correlating the results from a sec- 
ond administration of the same test, by the administration of com- 
parable forms of a given test, and by correlating two halves of the 
same test are not directly comparable to the data presented in this 
study. 


ORGANIZATION OF REPORT 


The present report has been organized into three major divisions 
designated as Parts I, IJ, and III.15 Part I is introductory in na- 
ture, The purpose of this division is to orient the reader in rela- 
tion to the problem. Part II is a report of the detailed procedures 
followed, the data secured, the interpretation of these data, and 
a summary of the conclusions which the data appear to warrant. 


2 Ruch, op. cit., pp. 89-90. 

*W.S. Monroe and L. B. Souders, Present Status of Written Examinations 
and Suggestions for Their Improvement (Bulletin No. 17; Urbana: Bureau of 
Educational Research, University of Illinois, 1923). 

“See Ruch, of. cit., chap. xi, for a review of typical examples. 

* This report includes with certain modifications research presented to the 
Graduate School of Arts and Sciences of Duke University as a doctor’s dis- 
sertation under the title,, Disparity in Results from New-Type or Objective 
Tests Constructed to Measure the Same Abilities. The dissertation was done 
under the direction of Professor William A. Brownell, 


18 Variability in Results from New-Type Achievement Tests 


Part III is theoretical in nature. The purpose of this division is, 
in the first place, to present a theoretical discussion of the implica- 
tions of the findings of this research for educational measurements, 
and in the second place, to set forth some hypotheses as to the sources 
of disparity in educational measurements. 

Chapter II is presented in order to make the problem here in- 
vestigated more meaningful. Omission of this chapter would not 
seriously disturb the continuity of the report of the findings. 


s 


CHAPTER II 


A Brier History oF THE PropLEM! 


As a result of the close relationship between education and meas- 
urement, formal education has been from its inception accompanied 
. by a correspondingly formal type of measurement designed to de- 
termine with some degree of immediacy and accuracy the extent 
to which the purposes of the educative process have been realized. 
A brief review? of some of the prominent examples of this attempt 
at formal measurement will serve to orient the reader to the re- 
search which is the basis of this report. 

Review of background factors. Probably all of the highly civil- 
ized ancient peoples developed some type of educational measure- 
ment. The Chinese boy was required to repeat the classic treatises 
from memory. The Greek youth was subjected to trying tests in 
logic, oratory, and related activities. The exact nature of the meas- 
ures which were used to determine the results of formal, academic 
training is not always clear. However, the evidence seems to war- 
rant the assumption that varying forms of examinations, oral and 
written, were used. Ruch comments as follows on this point: 

Oral quizzing, Socratic or otherwise, had been from time immemorial 
a part of the daily classroom routine; in fact, at times it was all of teach- 
ing. Formal written examinations are probably more recent than oral 
testing, but these date their origins many centuries ago; certainly formal 
written examinations were firmly intrenched in the educational system 


of China thirteen hundred years ago, and were familiar to Grecian and 
Roman teachers.? 


In America probably the most general method of educational 
measurement prior to the middle of the nineteenth century was the 
oral type, for in 1845 Horace Mann‘ made an extended argument 


*Chapter II, which deals with the historical background of the problem, 
may be omitted if the reader feels no need for a discussion of background 
factors. 

* No extended account of the techniques of measurement or of the applica- 
tion of these procedures to education can be undertaken here. Texts dealing 
with the history of science should be consulted for the more general develop- 
ment. A serviceable account of the history of the use of measurements in edu- 
cation may be found in C. W. Odell, Educational Measurement in High School 
(New York: The Century Company, 1930), chap. ii. 

*G. M. Ruch, of. cit., p. 3. 

*For a valuable detailed analysis of Mann’s contribution, ibid., pp. 4 ff. 


20 Variability in Results from New-Type Achievement Tests 


for the use of written examinations in the public schools of Massa- 
chusetts. 

An interesting occurrence in the modern testing movement, but 
one which had relatively little influence, is the work of Reverend 
George Fisher,® an English schoolmaster. Odell makes the follow- 
ing statement concerning the nature and significance of Fisher’s 
work: 

About 1864 he constructed a “scale book” which contained samples of 
typical questions and of various degrees of proficiency in answering the 
questions in several school subjects. The questions were intended to be 
models for the construction of future examinations similar in nature and 
difficulty. It is not apparent, however, that this work of Fisher’s attracted 
much attention at the time, although it contained the germ of a number 
of principles later employed.® 


The work of J. M. Rice seems to be the first direct step in the 
development of present-day educational measurement. Rice began 
his researches in 1894 and published his results in a series of articles 
in the Forum.? He gave tests in a number of subjects to pupils in 
different cities, but the portion of this work which is best known is 
that pertaining to spelling. In this phase of his research Rice gave 
an identical spelling test to pupils who had been taught spelling in 
periods of varying lengths and compared the results. This pro- 
cedure embodied many of the basic elements of later work on stand- 
ardized tests. 

Beginnings of the modern movement. The events just described 
are of interest from the point of view of history and may have had 
considerable influence upon the so-called scientific movement in 
the measurement of educational products. But the fact seems to be, 
to speak conservatively, that the precipitating cause of the rapid de- 
velopment of this movement was Thorndike’s pioneer book® pub- 
lished in 1904. This book, due to the author’s subsequent influence, 
definitely committed education to the methods of measurement which 
had proved so fruitful in the physical sciences. From the publica- 
tion of this volume the history of measurement in education is an 
account of the attempts which have been made to realize the ideals 
set forth therein. In this publication Thorndike presents the sub- 


°E. B. Chadwick, “Statistics of Educational Results,” The Museum: A 
ey Magazine of Educational Literature and Science, 11 (Jan., 1864), 
479-484, 

® Odell, op. cit., p. 31. 

"For specific references to these articles, ibid., pp. 31-32. 

* Edward L. Thorndike, An Introduction to the Theory of Mental and Social 
Measurements (revised ed., 1913; New York: Teachers College, Columbia Uni- 
versity, 1904). 


A Brief History of the Problem 21 


stance of his famous creed, which, in certain respects at least, is 
basic to the subsequent work in the construction of tests and scales. 
The position expressed in the statement that whatever exists at all 
exists in some amount and therefore can be measured® has been 
widely accepted in the literature of educational measurements. The 
following quotations are typical of the manner in which Thorndike 
further elaborated this creed: 


The scales in actual use in psychology, education, sociology, history 
and the like are often inadequate in respect to one or more of the essen- 
tials of a scale. The work of the student of mental and social measure- 
ments is, then, to replace them by better ones so far as he can, to devise 
methods to make the most out of those which he does not replace, and 
to avoid attributing to a measurement properties which the scale by which 
it was obtained does not justify. The last two tasks need no further 
mention at this point. Concerning the first, it has already been suggested 
that in cases where quantitative study of human nature and achievement is 
balked at the very beginning by the lack of series of defined amounts, 
whose differences from each other and from defined zero points are 
known, this lack is due rather to lack of study than to any essential insus- 
ceptibility of human behavior to rating in units of amount on intelligible 
scales.10 

The problem for a quantitative study of the mental sciences is thus 
to devise means of measuring things, differences, changes and relation- 
ships for which standard units of amount are often not at hand: which are 
variable, and so unexpressible in any case by a single figure; and which 
are so complex that, to represent any one of them, a long statement in 
terms of different sorts of quantities is commonly needed. This last diffi- 
culty of mental measurements is not, however, one which demands any 
form of statistical procedure essentially different from that used in science 
in general,11 


In reviewing the development of objective tests in education Mon- 
toe makes the following statement about Thorndike’s early book: 


In 1904 Thorndike published the first edition of his Mental and Social 
Measurements. In addition to an account of statistical procedure, this 
volume contained many of the principles upon which the construction 
of our present tests is based. It was revised in 1913, but the revision 
consists, primarily, of adding concrete illustrations of the principles. 


°Edward L. Thorndike, “The Nature, Purposes, and General Methods of 
Measurements of Educational Products,” The Seventeenth Yearbook of the 
National Society for the Study of Education, Part IT (Bloomington, IIL: 
Public School Publishing Company, 1918), p. 16; Walter Scott Monroe, An 
Introduction to the Theory of Educational Measurements (New York: 
Houghton Mifflin Company, 1923), chap. i; Charles Russell, Standard Tests 
(Boston: Ginn and Company, 1930), chap. iii; and William A. McCall, How to 
Measure in Education (New York: Macmillan Company, 1922), chap. i. 

“ Thorndike, Mental and Social Measurements, p. 18, 

" Ibid., p. 6. 


22. =Variability in Results from NM ew-T ype Achievement Tests 


This book is yet an important source of information for workers in this 
field, and for a number of years was essentially the only source,12 


The work of Rice shocked professional educators into confusion. 
His bold exploits into realms hitherto sacred and the disturbing re- 
sults which he secured threw those educators who were susceptible 
to the influence of new facts into a state of doubt. If such quan- 
titative results could be obtained in measuring some aspects of edu- 
cation, did it not follow then that all educational products might be 
evaluated in quantitative terms? The work of Rice constituted a 
significant practical step, but practical innovations rarely inaugurate 
movements. Movements have their beginnings in theory—in state- 
ments of creed. Such was Thorndike’s Mental and Social Measure- 
ments. This book threw the educational measurements machine into 
gear, as it were, and it has gathered momentum, in terms of quan- 
tity of effort, for the past decade—a momentum the practical re- 
sults of which have not always been conducive to real progress.13 
Supplied with a theory, workers in education began to produce tests 
and scales designed to realize the demands of that theory. Odell 
gives the following description of the early developments: 

It was not until 1908 that anyone followed up Rice’s work by publish- 
ing a standardized test or scale in any school subject. In this year Stone, 
a student of Thorndike’s, issued his arithmetic reasoning test. This is 
generally considered the first standardized subject-matter or achievement 
test. For the next few years standardized tests and scales appeared at 
the rate of about one a year, practically all of them being by Thorndike 
and his students. As was probably to be expected, these early tests and 
scales were for use in subjects entirely or primarily taught in elementary 
rather than high school. The following list includes those which had ap- 
peared by 1913, also one noteworthy but somewhat later one: Courtis: 
Arithmetic Tests, Series A (1909) ; Thorndike Scale for Handwriting of 
Children (1909); Hillegas Scale for the Measurement of Quality in 
English Composition by Young People (1912); Buckingham Spelling 
Scale (1913); Thorndike Scale for General Merit of Children’s Draw- 
ings (1913); Ayres Scale for Measuring the Quality of Handwriting 
of School Children (1912) ; Ayres Measuring Scale for Ability in Spelling 
(1915) .14 


These factors (the work of Rice, the theoretical statement of 
Thorndike, and the production of tests and scales) constituted the 
more important elements of the scientific movement in education. 
The general purpose of this movement was to make of education a 

* Monroe, op. cit., p. 5. 

“The reference here is to the numerous hurriedly constructed tests and 


scales and to claims beyond substantiating facts. 
™“ Odell, op. cit., pp. 34-35 


A Brief History of the Problem 23 


quantitative science—that is, a discipline that sought answers to its 
problems by an appeal to objective fact. Since educationists consid- 
ered appeal to fact as a method of solving problems to be possible 
only when measuring instruments capable of revealing these facts 
are available, much of the energy of the so-called scientific move- 
ment in education has been expended in an attempt to produce such 
instruments. In short, the proposal was to make of education an 
objective science, the chief requisite of such a consummation being 
the development of objective measurements. This term objective 
when used to characterize a basic and essential factor in the method 
of approach which would lead to a science of education is identical 
in meaning with the term as it is used in the physical sciences.15 

Attack upon traditional type tests. Educationists having agreed 
upon a method of approach and the goal of a science of education 
were confronted with the necessity of determining whether or not 
the methods of measurement which were in use at that time fulfilled 
the requirements of objective measuring instruments. With this 
problem as an issue, numerous investigators studied the type of tests 
then in use, namely, the essay or traditional type. This attack re- 
sulted in the production of much evidence!® in support of the thesis 
that these instruments of measurement were essentially subjective in 
their nature and, therefore, unreliable as scientific measuring de- 
vices. 

Development of objective tests. If the types of measurement in 
use were inadequate to the needs of a science of education, what then 
must be done? Obviously, the answer to this query was that more 
objective measures had to be invented—measures which could 
be used in the social sciences in a manner comparable to the uses of 
measurements such as those of length, weight, and time in the physi- 
cal sciences. When the problem had been thus clearly defined, re- 
search workers set themselves earnestly to the task of producing 
measuring instruments according to the specifications of the older 
sciences.17 

* Thorndike, op. cit., p. 11. 

** The specific investigations that produced this evidence have been reviewed 
too often and too well to demand another presentation here. A good summary 
may be found in Ruch, op. cit., chap. iii. 

The complex and mammoth nature of such an undertaking hardly seems 
to have occurred, at least consciously, to these workers. It might have been 
conducive to a more sober development of educational measurements had these 
facts been taken into account: first, that the social sciences are young; second, 
that the measuring instruments in use in the physical sciences are the products 


of a very long and slow development; and third, that probably such slow de- 
velopment could not be profitably short-circuited. For an excellent exposition 


= 


¢ 


24 Variability in Results from New-Type Achievement Tests 


Under the pressure of this demand for objective measurement 
in education, an event occurred which has had a significant influence 
upon the testing movement. There was developed a type of test 
item!8 which was called “objective.” Apparently intrigued by this 
term, educationists interested in the measurement of achievement 
have expended during the past fifteen years a great portion of their 
energy in the construction, analysis, and praise of these so-called 
objective tests. The old or traditional-type test became discredited 
because of its unreliability or subjectivity ; the new-type or objective 
test seemed to fulfill the requirements of science and therefore came 
to be used very widely and in some cases uncritically. The move- 
ment took two directions: first, the development of commercial 
tests, and second, the use of teacher-made objective tests. The ex- 
tent and nature!® of these phases of the commonly called measure- 
ment movement in education and the enthusiasm with which they 
have been prosecuted are matters of common knowledge. 

Cause for the popularity of “objective” tests. The cause basic 
to the rapid production and use of this new-type measurement is sig- 
nificant. A frequent cause for the spread of a method of procedure 
is the acceptance on the part of those adopting such procedure of 
assumptions which tend to prejudice them in favor of the method 
in question. Was there such an assumption in this case? Thorn- 
dike and others had proclaimed in very strong terms that objective, 
scientific measuring devices are basic to a science. A challenge had 
been made to this youthful discipline, education, that it show the 
qualities requisite to an entrance into the group of established sci- 
ences. One of these qualities, according to Thorndike and others, 
was an ability to measure objectively phenomena related to its prob- 
lems in the sense in which the term had been used in relation to the 
physical sciences.2® The answer to this demand was the production 


of this point see Wolfgang Kohler, Gestalt Psychology (New York: Horace 
Liveright, Inc., 1929), chap. ii. 

* The origin of this type of item, as such, does not seem to be clear. Odell 
(op. cit., p. 41) makes the following comment on this point: “At the beginning 
of 1920 appeared an article by McCall, which seems to have been the first pub- 
lished discussion along this line.” The general features of this type of test 
item very likely were patterned after the items used in tests of general 
intelligence. 

* See Odell, op. cit., chap. ii; Ruch, op. cit., chap. iv. Almost all of the 
numerous texts in the field of measurements supply facts bearing on the point. 

®A. D. Ritchie, Scientific Method (London: Kegan Paul, 1923), chap. v; 
Herbert Dingle, Science and Human Experience (New York: Macmillan Com- 
pany, 1932), chap. vii; Daniel Sommer Robinson, The Principles of Reasoning, 
An Introduction to Logic and Scientific Method (New York: D. Appleton and 


A Brief History of the Problem 25 


and wide use of measuring instruments designated as objective. It 
seems clear then that an assumption basic to much that has been done 
and said in relation to measurement in education is that tests which 
are relatively objective in regard to scoring?! are therefore objective 
in the broader sense. A knowledge of the extent of disparity in re- 
sults from new-type or objective tests should give some indication 
as to the soundness of this assumption. 
Company, 1924), Pt. II; Kahler, of. cit., chap. ii. 

= This is considered an accurate characterization of the “objective” test. 
See Charles W. Odell, A Glossary of Three Hundred Terms Used in Educa- 


tional Measurement and Research (Bulletin No. 40; Urbana: Bureau of Educa- 
tional Research, College of Education, University of Illinois, 1928), p. 43. 


PART II. THE INVESTIGATIGN 


(a.) INFORMAL OR TEACHER-MADE TESTS 


CHAPTER III 


Metuops UsEeD IN THE STUDY oF TEACHER-MADE 
OBJECTIVE TESTS 


The problem. The problem of this part of the investigation was 
to determine the extent to which scores from two objective teacher- 
made tests tend to be identical when these tests are used to measure 
the achievement of a group of children with reference to a designated 
body of subject matter. The problem may be stated as a question: 
If two competent teachers independently of each other prepare ob- 
jective tests which they consider accurate measures of pupil knowl- 
edge of the same twenty pages in a geography text, and if they give 
these two tests to the same pupils, how will the results from the two 
tests compare? 

Facts and conditions requisite to a solution of the problem. In 
order to work toward a solution of the problem, scores from a num- 
ber of comparable objective tests had to be obtained. It was neces- 
sary that competent teachers agree to construct objective tests cov- 
ering identical or nearly identical subject matter and that these tests 
be given in pairs to groups of children. The decision was made to 
secure in a real school situation the data which a solution of the prob- 
lem demanded. The findings, for this reason, must be interpreted as 
applying to objective tests when such tests are used by competent 
teachers in connection with their regular activities in certain subjects. 

The school systems. The school systems of the city and county 
of Durham, North Carolina, were selected for an attack upon the 
problem. The investigation was made during the school year 1934-35. 
The school systems in which the study was prosecuted are rated 
among the most efficient systems of the state. The school popula- 
tions of the city and county are typical of school populations in gen- 
eral, for there are schools which serve industrial communities, rural 
sections, and small towns. 

No part of the findings of this study should be in any sense con- 
sidered as a reflection upon the schools in which the data were 


The Study of Teacher-Made Objective Tests 27 


gathered. These data should be interpreted as facts which were 
secured in school systems that are fairly representative of the bet- 
ter public schools of the area in which they are located. 

Training and teaching experience of teachers who co-operated. 
The problem as stated was to compare the results from tests made 
by competent teachers. This is a point of some significance, for if 
the teachers who constructed the objective tests used in the study 
were not so well qualified as the average teacher to construct such 
tests, the validity of any generalization drawn from the data might 
be questioned. 

A summary of the facts pertaining to the academic training and 
teaching experience of the teachers who constructed the tests used 
in this study is presented here: 


Number of teachers with noidegree .......0000..00eesseorscuas 9 
Number of teachers with Bachelor’s degree...............200: 26 
Number of teachers with Master’s degree............c0c0ceeees 6 
Number of teachers with Doctor’s degree. ...........cecescees 1 
Mietiamiby Gat ston Cx Pe LIEM CE) tn ctele ce) <icia cpu oiere 2 o\<:a)e/eieieis ereisveyers selene « 13 


The facts should be noted that twenty-six, or 74.3 per cent, of the 
thirty-five teachers who constructed the tests had completed suffi- 
cient academic study to receive the Bachelor’s degree, and that seven, 
or 20 per cent, had received one or more advanced degrees. No 
teacher had done less than two years of college work; further, all 
the teachers who did not have degrees, as well as those who held 
degrees, had done summer school work at regular intervals during 
their teaching careers. The median years of experience, thirteen 
years, seems to indicate that the teachers had ample experience. These 
facts warrant the conclusion that the teachers who constructed 
the tests used as a basis for this research were competent public 
school teachers, if one may judge competency by training and ex- 
perience. 

Attitude of teachers. Although the facts presented in the fore- 
going paragraph show that the teachers who assisted with this in- 
vestigation had adequate training and experience, these things alone 
would not have qualified them to fulfill the purposes of the study. 
The validity of the findings in such cases depends upon the gen- 
uineness of the teachers’ efforts. 

Those individuals who assisted with this investigation manifested 
a lively interest in its progress. This interest resulted from several 
causes, or conditions. First, the teachers felt a need for improving 
their testing procedures. Second, a personal conference in which 


28 ~=Variability in Results from New-Type Achievement Tests 


the nature and the purpose of the work were carefully explained 
was held with each teacher. Third, the marking, scoring, and inter- 
preting of the tests were a part of the regular school work. Fourth, 
any teacher who for any reason expressed a hesitancy about assist- 
ing with the investigation was immediately excused from all obliga- 
tion with respect to the study. Conscious effort was made to avoid 
any form of coercion (administrative or otherwise) that might have 
caused uninterested teachers to co-operate. 


GENERAL DESCRIPTION OF PROCEDURE 


Example of the technique. The example given here is taken 
from that part of the investigation carried out in fifth grade history. 
Teacher V and Teacher I were asked to agree upon a block of text 
material which would be covered in five weeks of classroom work. 
The teachers taught this part of the text as they had planned to 
teach it. Near the end of the teaching period the teachers, inde- 
pendently, constructed objective tests which they considered ade- 
quate to measure the pupils’ acquaintance with the text material. 
Both of these tests were mimeographed exactly as the teachers di- 
rected ; and both tests were given, in this case, to Teacher V’s pu- 
pils. The two tests were then scored and the scores put into per- 
centage figures in accordance with directions given by the teachers 
who constructed the respective tests. Thus each pupil had two scores: 
the one representing his knowledge of the text material as measured 
by his teacher’s (Teacher V’s) test, the other representing his knowl- 
edge of the same material as measured by Teacher I’s test. An ex- 
ample of the scores from the two tests given in the manner described 
is presented in Table I. 

The test constructed by Teacher V is called Test V and that con- 
structed by Teacher I is called Test I. The figures appearing in 
columns 2 and 3 of Table I are the percentage scores of the group 
of pupils on the two tests. For example, Pupil 1 made a score of 90 
on Test V and a score of 79 on Test I. An examination of the two 
columns of percentage figures in Table I gives a clear indication of 
the nature of the original data of this phase of the study. 

The tests. Since the investigation was conducted in an actual 
school situation, there were only two logical criteria for the construc- 
tion of the tests: (1) that they be the new-type or objective test, 
and (2) that they be adequate for the desired measurement. Any 
other requirements as to the nature of the tests would have tended 


The Study of Teacher-Made Objective Tests 29 


TABLE I, PERCENTAGE Scores ON TEST V AND Test I 











Pupil Score on Test V Score on Test I 
Mates (oaVer-tersicistevsva'ev vce Bev ate hares May ee sciatiieistateeie eisie thet 90 79 
1). ceigadt te yc O BOC CC COO OCE AD EGET Tat t aa rae 85 71 
Sno Bence cb ICCIEI CIC CEE LO TOILE TOR La on 70 87 
Beate ete Pete totais etal ciorayoiei Sitios 0) bie, ayoy6-s/eiel’seiwie sve.e aves 92 75 
EE Meret STS Nels o(olc(atvess/o1c) sve) shapers se aca: aketecre's ons 95 79 
Gerrits tisieice siciniesjeistsaaiouciieie soe eewe 67 71 
erential felis a [erayc eine Ac. taiarela « slave wid v avacs 92 75 
Ere teeter Vavacy vaca Yai atey aliases aie o-ste\sie ie cle « 60 67 
Pra cinretiereseiei cate neta a itovaisintels, aati cise ie aeitieis'a'cbe Bi 54 
eRe MSY terete ced oy oiel rib nia li ince (o\acese,aTasalavelacéeaseiels\ ers 92 71 
DElprerterietsieicrielanceicleisisitists eiaveis ere Os tis eas etelerers ddiees 80 71 
MERE Pets feiciala ate eietaareiei see wale nvalo'dsaleverd aiaieda 92 79 
ASME eels tev) etale cle iain s inl sivieisje/eiciaa\s ive ioe Aioress ss 72 75 
eS eMeMEe Pee Pam S PT Perese Lclsrey seer si cies ease aces by ave ave @reyersiz Paiva 97 67 
Sarin eect teh a Ea ee= Favor Cote ro arsine) ope /6vete- iW» Talons ous pleia’ e1arah< 72 66 
UME Tc tatevtctereveietofet eves 5 aiaisjs seis. aieielelo cio ied crscacs 85 62 
SPEER ey etii ty si isictersieinicie ale toreuriee wklesaeteies: 75 71 
ELBE EPR NACE Tevet chcls seein vic) siaickelersiars earavevocsre-arévova grsssid dna 60 75 





to formalize the study and to divorce it from the actual school situa- 
tion. 

The tests were composed of commonly used types of item. The 
first items from one randomly selected test in each subject field are 
presented here in order to give an indication of the types of item 
involved. 


Geography (fifth grade): When water changes to vapor we call the 
process (condensation, evaporation, moisture). 


Geography (fifth grade) : In coming from the Congo to the United States 
OUR OUIGKGHOSS) UME sacroiliac cies +e Ocean. 


Geography (seventh grade): We are what we are largely because of 
where we are. (True-False) 


History (fifth grade): The story of our nation begins more than (100, 
300, 400, 800) years ago. 


History (seventh grade): The Greeks had a strong central government. 
(True-False ) 


History (high school): .... Tariff of Abomination, .... Compromise 
Tariff of 1833, .... Force Act, .... South Carolina Nullification 
Ordinance. (To be arranged in chronological sequence). 


30) Variability in Results from New-Type Achievement Tests 


The sample items presented show only a part of the types used, but 
this chance sample will give a relatively clear indication of the na- 
ture of the tests. 

The average length of the tests used in this study was approxi- 
mately forty items. Some of the tests were relatively short, but two 
facts should be kept in mind in this connection: first, the tests were 
considered adequate by competent teachers, and second, these shorter 
tests covered brief units of material.t 

More detailed analysis of the tests. The purpose of this section 
is to provide detailed information concerning the nature of the 
tests used in this part of the study. These facts appear in Table II.*? 
The most important facts in Table II* are summarized in Table III. 


Tasie III. SumMaAry or Facts WuHicH PERTAIN To IMPORTANT CHARACTER- 
ISTICS OF TEACHER-MADE OBJECTIVE TESTS 


Type of Item Number of Items Per Cent of All Items 

Completion sc 5 cies s\e:nisie oynis cn ecoverarciheverelernieiereieeiaaare 698 28.6 
Matching): niscics.c sacle celsiaetre son cetera dante feet 363 14.9 
Miultiple—Choice'sjes)./.:ctocrsiviessia vieieseievciosstoriaie tetera 371 15.2 
Perme—Pialge 85 <yere ny 1a,clwycrersxnvess oy ovsismieyel het eeeare ree 973 3929 
MjOCAMON 1 s:s/c ctor ciere ais clvidieleteleiots sisle elector eieere teen 18 on 
Wancl asian fied oe ictate(sresste. & spere:asayerorote’a/oiniape ae oleeieelotiine 16 -6 

Totally ciescccicisarn ce ayeraatns oe raeyereyereme 2439 99.9 
Number of tests covering designated text material..............0eeeeecceceecceeeees 42 
Number of tests covering semester Of Work..........c00c nce eo ciem c slclo slices omleinlemelerere Dh 

"Total number of testa sss. 52:sye1s <iere ps cte-0,s eo} cle cisyate ale fareiatetetatnre Che iets late nn 63 


The classification of items used in Table III follows in general 
the classification presented by Ruch in his Objective or New-Type 
Examination? In the case of the four types most frequently used, 
a number of variations appeared in the tests; that is, there were a 
number of sub-types of each of these main types. The data presented 
in Table III show that approximately 40 per cent of all the items 


1The relation between the length of a test and its reliability is not so sim- 
ple and direct as has often been implied in the measurement literature. Re- 
cently the fact that the reliability of a test may be increased by decreasing its 
length has been demonstrated. This means that the types of item which are 
added to or eliminated from a test will determine the resulting effect upon 
reliability and not the increasing or decreasing of the number of items per se. 
For an illuminating discussion of this problem see R. R. Willoughby, “The 
Concept of Reliability,” Psychological Review, XLII (March, 1935), 153-165. 
Also, W. A. Brownell, “On the Accuracy with Which Reliability May Be 
Measured by Calculating Test Halves,” Journal of Experimental Education, I 
(March, 1933), 204-215. 

? An asterisk following the number of a table indicates that the table appears 
in the Appendix of this report. 

* Ruch, op. cit., pp. 189-190. 


The Study of Teacher-Made Objective Tests 31 


were some form of the true-false type. The second most frequently 
used type of item was the completion ;* about one-fourth of all the 
items are classified under this heading. Finally, approximately 15 
per cent of the items were of the matching type and about the same 
number of the multiple-choice type. Table II* reveals, also, that 
forty-four, or 69.8 per cent, of all the tests were composed of two 
or more types of item. 

Two attacks made. Data which bear upon the teacher-made test 
part of the investigation were gathered by two methods which dif- 
fered in minor respects. In the case of both methods the basic tech- 
nique already described was used. The chief difference consisted 
in the fact that in one attack the tests were constructed to cover an 
exactly designated unit of text material, while in the other the tests 
were made to cover the work of a semester. In the latter case the 
same texts were used by all teachers whose tests were paired, but 
no attempt was made to guarantee that the tests covered identical 
text material. Since the teachers whose tests were paired taught 
in the same or closely affiliated school systems and followed the 
same course of study, the differences, in reality, in text matter were 
relatively small. 


DETAILED DESCRIPTION AND ANALYSIS OF FIRST ATTACK 


Subject matter covered by tests. The teachers who taught a given 
subject at a given grade level® were asked to agree upon a body of 
text material on which the tests were to be based. The text matter 
varied from grade to grade and from subject to subject, but was 
identical for any given subject and any grade level. 

Instructions for making tests. As stated under the topic “Ex- 
ample of technique,” the teachers were instructed to make tests that 
they considered adequate measuring instruments for determining the 
achievement of their pupils with reference to the designated text 
material. The tests were to be designed to furnish an adequate 
basis for giving the pupils marks which accurately represented their 
achievement. All of the teachers habitually made use of the new- 
type test in their regular work; but in order to insure that the tests 
be of the type desired, each teacher was given a mimeographed list 
of examples of the more frequently used types of objective test. 

*The term “completion” as used here is synonymous with Ruch’s “Recall 
: ° Fifth and sixth grade geography, fifth and sixth grade and high-school 


history, and sixth grade health were the subjects and grade levels used in the 
investigation. 


32 Variability in Results from New-Type Achievement Tests 


However, the teachers understood that this list was suggestive and 
not limiting. 

In addition to the two restricting suggestions mentioned on page 
28, the following points were emphasized : 

1. Make such directions for the test as you would consider ade- 
quate for an effective administration of your test. 

2. Use any type of item or as many types of item as you wish 
to use in making your test. 

3. Let the length, as well as all other characteristics, of the test 
be determined by your judgment as to the number of items necessary 
in order that the test be an adequate measuring instrument. 

4. Do not confer regarding the nature and content of your test 
with other teachers who are taking part in the investigation. 

Pairing teacher-made tests. Each test was mimeographed in ex- 
actly the form suggested by the teacher who constructed the test. In 
order that the tests might be easily identified each teacher was as- 
signed a code number which was placed on the first page of the 
test. Thus a test in history might have the code number IVH5 in 
the upper left-hand corner of the first page of the test, which identi- 
fied the test as having been made by teacher number IV in fifth grade 
history. 

Teachers whose tests were made to cover the same text matter 
were paired for administration of the tests. Since each group of 
pupils was to take two tests (the test made by the teacher of the 
group and a second test constructed by another teacher), a second 
test had to be chosen for each group of pupils. In order to avoid 
any tendency to pair tests which were especially similar or especially 
dissimilar in content or form, teachers were paired by the chance 
method, without regard to the nature of the tests which they con- 
structed. After the second test for a group had been thus selected, 
sufficient mimeographed copies of the two tests were placed in the 
hands of the teacher of the group for administration. 

Directions for administering the tests. After the plan for ad- 
ministering the tests had been carefully explained during an inter- 
view, each teacher received a mimeographed set of directions de- 
signed to neutralize the effect of practice and fatigue in the pupils 
who took the tests. A copy of the directions follows: 


Directions for Giving History and Geography Tests 
Norte: It is extremely important that every teacher follow the direc- 
tions exactly. It is only by following the directions carefully that the re- 
sults will be of value. 


The Study of Teacher-Made Objective Tests 33 


1. In order that the results on the two tests you are giving may be 
comparable it is necessary that the tests be given at the same time. If 
one test is given before the other there will be some practice effect or 
perhaps some confusion. You are requested, therefore, to give the tests 
in the following manner. Let us call your own test Test A, and the other 
test which you are giving Test B. Give every other row in your class- 
room Test A, and the remaining children Test B. Then during the sec- 
ond half of the period give Test A to the children who previously had 
taken Test B, and Test B to the children who had taken Test A. For 
example, children occupying rows 1, 3, 5, and 7 would take Test A while 
the children in rows 2, 4 and 6 are taking Test B. Then during the second 
half-period, children occupying rows 1, 3, 5 and 7 would take Test B 
while those occupying rows 2, 4, and 6 take Test A. This procedure, as 
you see, will tend to equalize practice effect or any confusion that a test 
might cause in a child’s mind. 

2. Write down any directions that you give to the pupils other than 
those appearing on the test, both for your own test and for the other 
test you give. Give such directions as you consider necessary but remem- 
ber to keep an accurate account of the directions that you give. (Use the 
attached sheet for reporting this information. )& 


The pupils were allowed sufficient time to complete the tests. 
Since the pupils looked upon these tests as their regular periodic ex- 
amination, the motivation was relatively strong. The pupils were 
not acquainted with the fact that they were taking a test made by 
another teacher. In order that the situation might be as normal as 
possible, in each case the two tests were given by the teacher of 
the group that took the tests. 

Scoring of the tests. At the same time that he constructed his 
test, each teacher made out a key and a set of directions for scoring 
his test. The teachers scored their own tests; the second tests were 
scored by the investigator with the use of the proper key. Precau- 
tions were taken in order to avoid errors in scoring. 

Teachers, tests, pupils, and test papers involved. The scope of 
this part of the study can be best understood when one is acquainted 
with facts relative to the number of teachers, of tests, of pupils, and 
of test papers which were involved in gathering the data. Table IV 
is presented in order to supply these facts. 

Table IV shows that twenty-four teachers? assisted in this study. 
These teachers constructed a total of forty-two tests which were ad- 


° Other directions of a routine nature which pertained to such matters as 
date on which tests should be administered were included on the mimeographed 
sheet of directions, but these points, not bearing upon the issue under dis- 
cussion, are not given here. 

“In a number of cases the same teacher made two tests; for example, a fifth 
grade teacher would make a test in fifth grade history and in fifth grade geog- 


34 Variability in Results from New-Type Achievement Tests 


Taste IV. Teacuers, Tests, Purits, AND Test PApErs INVOLVED IN THE 
Stupy or TeacHEer-MaApe OsjectivE Tests WuicuH Cover 
SHortTER Units oF MATERIAL 











Classification Frequency 
FDOT Ore is acu ara'ssosety 1s orm bisa de vin Teleaano evot teeta ots AER TARE tei ete 24 
TQS EA suc soot tir< coBVarsieyosvssyoh gna die(d ovaeralerie Gro is METER Eo 42 
PUD ILO Mey ore oie trace pissa o-vverays:s fi0Ordira el erage fevers ele ae ene 2,255" 
Tat PAPENG cic atts esdiecesdala icine tee PRR ovine nn en 4,510 





*When duplications are subtracted the number of pupils is 1,575. 


ministered to 2,255 pupils.8 Since each child took two tests, the data 
of this phase of the investigation are based upon 4,510 individual test 
papers. 

Variation of the first attack. The general techniques of the first 
attack were altered somewhat in the case of two groups in high-school 
history and in the case of the health group. The health tests were 
given to a group of one hundred pupils. These tests were constructed 
and administered in accordance with the directions already described 
except that the paired tests were not given to the pupils of the teach- 
ers who made the tests. This permitted significant comparisons of 
the two tests as measuring instruments. 

The same procedure was followed in the case of two pairings of 
high school history tests. These four tests or two pairs of tests 
may be identified in the subsequent chapters by the code numbers 
ISHSV-IISHSV and IIISHSV-IVSHSV. 


DETAILED DESCRIPTION AND ANALYSIS OF SECOND ATTACK 


Umit for testing. As has been stated already, the unit over which 
the tests were made in the second attack was a semester of work in 
a given subject. Teachers were paired only when they used the 
same text. The point has been made that since the texts were the 
same, the school systems the same or closely related, and the gen- 
eral courses of study were identical, the tests in reality covered very 
similar subject matter. The tests were the regular semester final 
examinations except that the two tests were given instead of the 


raphy. Or in the case of those schools which use the platoon system, the 
geography teacher would construct tests for fifth and sixth grade geog- 
raphy. The number of teachers given above represents the number of individ- 
ual teachers who assisted in the study. 

*In_ many instances the same pupils took a geography test and a history 
test. For example, a fifth grade pupil would take a fifth grade history test and 
a fifth grade geography test. In calculating the number of pupils these pupils 
were counted twice. The total number of pupils when such duplication is elim- 
inated is 1,575. 


The Study of Teacher-Made Objective Tests 39 


one which would have been given normally. With the exception 
mentioned in the preceding paragraph the procedures used in the 
second attack were identical with those used in the first. The data 
based upon semester tests were gathered in seventh grade geog- 
raphy, seventh grade history, and high-school history. 

Teachers, tests, pupils, and test papers involved. Table V shows 
the number of teachers, tests, pupils, and test papers which were in- 
volved in the second attack. 


TABLE V. TEACHERS, Tests, Pupirs, AND Test Papers INVoLVED IN STUDY oF 
TracHER-Mape Osjective Tests Wuicu Cover a SEMESTER oF Work 








Classification Frequency 
SBM Laer rr ee eral yee 2.Vo 15.1 vd iciavAaistoicied wie'e «.« aicia sere 11 
Siler cena TAt mR eter eT ata cE ios so «etsse'ms oe thee witieci-aravelpocaidin'a etia.w-c 21 
ELS OTe TREN Tay ors iia a <a 6. Soa els Ors ayant aparece 878* 
SILonERT TI CLR REMI SENN Sct ay fe eyice/ ss) ctsyee hrc zis lc. nici ntetetiverhsi horeralend «.aralts 1,756 





*When duplications are subtracted the number of pupils is 516. 


The data in Table V show the scope of the second attack upon 
the teacher-made test phase of the research. The table shows that 
eleven teachers constructed a total of twenty-one semester tests; 
that these tests were given to 878 pupils; and that since each pupil 
took two of the tests there was an aggregate of 1,756 test papers. 


TYPES OF RESULTS SECURED 


The results of the part of this research which is based upon 
teacher-made tests are presented and interpreted in Chapters IV, V, 
and VI. Table I (see p. 29) was presented as an illustration of 
the basic technique used in this part of the investigation. The two 
sets of percentage scores for each child or each group may be con- 
sidered as the basic data. How should these data be analyzed in 
order to show most clearly the extent of disparity in the results from 
teacher-made objective tests? Three methods of analysis are em- 
ployed: correlation, percentile point disparity, and teacher-mark dis- 
parity. 

As an illustration of the types of analysis which are made, the 
facts presented in Table I are given again at this point and to these 
facts are added data basic to the analyses which are presented in 
Chapters IV, V, and VI. A sample set of these data appears as 
Table VI. The paired tests in this instance were Test IH5 and 
Test VH5. The tests were given to the pupils listed in column 1 
of Table VI. Column 2 shows the score that each child made on 


36 = Variability in Results from New-Type Achievement Tests 


Test VH5, and column 3 shows the corresponding score for Test 
LES; 

a. The first analysis consists of correlating the two sets of 
scores in columns 2 and 3. The coefficient in this case is .239. 

b. The second type of analysis is made in terms of percentile 
rank disparity. The figures in column 4 show the percentile rank 
of each pupil on Test VH5, and those in column 5 reveal his cor- 
responding percentile rank for Test IH5. These facts permit the 
second type of analysis. For example, Pupil 1’s percentile point 
position on Test VH5 was 70 (i. e., 70 per cent of the pupils made 
scores lower than Pupil 1), whereas his percentile position on Test 
IH5 was 83. Thus there was a difference of 13 in Pupil 1’s per- 
centile rank on the two tests. The second analysis is made in terms 
of these differences. 

c. Finally, the letters in columns 6 and 7 represent the marks 
which the child’s performance on the two tests would warrant ac- 
cording to the system of marking used in the school system in which 


TaBLe VI. SAMPLE SET oF Data ILLUSTRATING TyPES oF ANALYSES USED IN 
THE STuDy OF TEACHER-MADE OBJECTIVE TESTS 


Pupil Score on Score on P. R. on P. R. on Mark on Mark on 
Test VHS Test IHS Test VHS Test IHS Test VHS Test IHS 
(1) (2) (3) (4) (5) (6) (7) 
1 90 79 70 83 A Cc 
2 85 71 58 45 B Cc 
3 70 87 25 96 Cc B 
4 92 75 81 64 A Cc 
5 95 79 93 83 A Cc 
6 67 71 20 45 F c 
7 92 75 81 64 A iC 
8 60 67 13 32 F F 
9 37 54 1 4 F F 
10 92 71 81 45 A Ee 
11 80 71 49 45 B Cc. 
12 92 79 81 83 A Cc 
13 72 75 35 64 Cc ic 
14 97 67 99 32 A F 
15 72 66 35 27 ¢ F 
40 85 62 58 18 B F 
41 75 71 42 45 Cc Cc 
42 60 75 13 64 F Cc 
Nive eid bhhaveeabeo dot ecaeloa aerate ETE 42 
Mean P. R. Difference...............++ 26.6 
Mean Mark Interval Difference......... 1.29 


The Study of Teacher-Made Objective Tests 37 


the study was made. As an illustration, Pupil 1’s mark on Test 
VH5 was A, and his mark on Test IH5 was C. This means that 
the two marks differed to the extent of two mark intervals. These 
facts permit an analysis in terms of teacher-mark disparity or varia- 
tion, the third type of analysis used. These analyses of results to- 
gether with some interpretation of their meaning appear in the next 
three chapters. 


CHAPTER ALY 


CORRELATION ANALYSIS OF RESULTS FROM PAIRED INFORMAL 
oR TEACHER-MADE OBsjEctTivE TESTS 


Purpose of chapter. In this chapter the data secured as described 
in Chapter III are analyzed by the use of the correlation technique. 
Each of sixty-eight groups of pupils was given tests which were con- 
structed by two different teachers to measure the same abilities, 
and the two sets of scores which resulted were correlated. The 
purpose of this chapter is to present and interpret the coefficients of 
correlation thus secured. 

The disparity between paired teacher-made tests is expressed in 
this study in three ways or in terms of three types of indexes. One 
index of the extent of disparity is the degree to which scores on 
paired tests are related as represented by correlation.1 The correla- 
tion coefficients presented in this chapter are offered, therefore, as 
one index to the extent of variability in teacher-made objective test 
results. 


CORRELATION BETWEEN TESTS WHICH COVER IDENTICAL UNITS 
OF TEXT MATERIAL 


Fifth grade geography. In fifth grade geography paired tests 
were administered to ten groups of pupils. As a rule, for purposes 
of analysis, one teacher’s pupils were considered as a test group.? 
The scores made by the pupils in each test group on one test were 
correlated with the scores made by the same pupils on the second 
test. The ten coefficients secured are presented in Table VII. 

Table VII shows that the mean correlation for the ten groups 
is .523. The highest correlation revealed is .71 and the lowest is .247. 

The fact should be kept in mind that the test groups used in 
fifth and sixth grade geography and history included all the pupils 
in these grades in a particular school. In cases where N is more 
than forty-five the test group was composed of high and low ability 
sections of pupils. The correlation coefficients were calculated from 
the test scores of both sections taken together. This means that the 


* When applied to tests, coefficients of correlation secured in this manner are 
frequently designated as coefficients of reliability. See pp. 69-70 for a brief 
discussion of the meaning of correlation coefficients as applied to tests. 

?In all cases, the results are presented in terms of test groups. 


Correlation Analysis of Results 39 


TasLe VII. Firra Grape GroGRAPHY—COoRRELATION BETWEEN INFORMAL* 
Oxjective Tests Wuicu Cover IpentiIcAL TEXT MATERIAL 











Tests r N 
PETG SRM GSE rare Yasin) a8 od ovayanzinie les «  Fa.sieieaved gee .710 33 
MSM RUNG SaPee yee si cicisia eis jnie desVeicvere oe eames -662 55 
PRANUE CS UL Grote tarstet tense sy3.2 icicle alate tire are sie mrcielin le .659 86 
RGM OG Ses ren yess eiftessintayciees See eth aha ones 643 42 
RSH WUL OS larcta clas otrevw este. ¢ndiesv cise sa gee 568 99 
RTM DANG Dee ores cof dices veaiseclscnse ws .523 72 
PPAer SCOR eerste sree cath saves feelers oe talasn sven Gooditette -452 65 
STG VIDDLG Sie cinerea ceecle dy faa bes oea ne 438 44 
Mer UV CG Serene feta oi cisioreiasiche eivieloclelbiusle da asi 344 28 
CSS ME DG oer lore eae aia 7as0 crate ices chess: sities ole 247 88 

IVT already foarte tet deers staid, cie\ sxe" disrieayes vie .523 61.2 








*The terms “informal” and “teacher-made” are used as synonyms. 

tTeacher XII, Geography, Grade 5 — Teacher V, Geography, Grade 5- 
relationship would probably have been somewhat lower if the r’s 
had been based upon the separate sections, for in such case the 
spread of the scores would have been less. 

Sixth grade geography. Table VIII presents the same facts for 
sixth grade geography as were given for fifth grade geography in 
Table VII. 


TasLe VIII. SixtH Grape GrocRAPHY—CoRRELATION BETWEEN INFORMAL 
Oxsjective Tests WuicuH Cover IpENTICAL Text MATERIAL 











Tests r N 

PALS OME NIG Gi terrier eye Movie olereis oienic wid nag Sisters .770 37 
gE rts NGG er tert ioeci lie wie cc orsbers aie vides ecths -718 45 
RING MME GO eee ater asi sie ass, < are ats avete.cck.o,x\d vis aso os -621 78 
PSC Ge Glare pene ereiesciejnse ols wtotdisins av avo a teretbrere ate .539 66 
ATCO strated.) fs etayess sis uicaio ee ve 395 67 
WANING ONG Otani heo tole vic ciercisyc stare view nce diaeouavs .320 84 
NY Satyam NGO re et Na ele tn sa cyeees ok 5hesoj2 cszic nia loieuw ben's .197 71 
IN Fesimrrctert scene Riceiaie ois cls hewn eocisvaracste .508 64 





The range of coefficients for sixth grade geography is from .197 
to .770 and the average correlation for all test groups is .508, 

Summary for identical unit tests in geography. In fifth grade 
geography and sixth grade geography a total of seventeen paired 
tests were given and the scores for each test group correlated. The 
mean correlation for these groups is approximately .50. 

Fifth grade history. Eight groups of children were given paired 
tests in fifth grade history. When the test scores were correlated, 
the coefficients which appear in Table IX were found. 


40 Variability in Results from New-Type Achievement Tests 


Tas_e 1X. Firra Grape History—CorrELATION BETWEEN INFORMAL 
Opyective Tests Wuicu Cover IpenticaL Text MATERIAL 





Tests r N 

1HS VETS So ciels Saison pel snare ee MOOD Estab Talat 684 54 
FEV TETS: DIVES osc: 25 bisisia, vistacete’ecaleverbrmiaveuiesied utstataters .569 32 
aT arava Anveiaiae ote esia a ointere a atetslelatetenne? .550 44 
DPS VIS ck sictntriwie dy an nlalatan hie opis ea roee 37. 
PETES SVUTEIS Soi, ersro veya ecolecetaesaielaratern oth iy ieieis tot signin -438 28 
VETS mR TETS ete erates ars snicis cia ot caee waa pretreat -382 75 
WES URIS is asst vraets the or vie eattaa ats eroraarevete <a5e 64 
VES) Pee screstio pints riers cote err eit eee ar yh, 42 
Meanivan2airie acest cere a ainte aiaets -468 47 


The range of the coefficients is approximately the same as in 
the case of fifth and sixth grade geography. As is shown in Table 
IX the mean correlation for fifth grade history is .468. 

Sixth grade history. ‘The relationship between paired tests in 
sixth grade history is shown in Table X. 

Table X reveals a mean correlation of .535 for the nine test 
groups. The highest coefficient is .661 and the lowest is .377. 


Tas_e X. SixtTH GrapE History—CorRELATION BETWEEN INFORMAL 
OBJECTIVE TESTS WHICH CovEeR IDENTICAL TEXT MATERIAL 


Tests r N 
MEIGS VOG 2: tetcrarevocrocce ate soins oe rea -661 31 
DTG DVI ye ere rare crore sinners, Nevasolovohoteste elelesietee res -633 . 37 
AVRTITELG UE Ges ra srare to ote folavetere overctere cle retecleiehs ereenereeens .606 65 
WUHIG) {SUTEG 5 occ to teaverereyoraialatelarloterntetevotors Nerestete tere -591 45 
MG) <TH G8 a creveevot vomvawsteroeiserlecrsciraree santas -550 41 
WHO “VAG Sc 7ke cor serocieteiotel soeicisio ne CE eee 537 74 
TVG SXUITG. see asetiie alos ee ctenie ers Serve eee -457 44 
DVEIG STAG Se irecorerarate protec eusteraniactnetteraniee einer -407 41 
MUTT G=VVING sie tysitereieicietrsorsin osrecieioeioeiete sretetoeers -377 64 

Mean ieonticistaetcattansimcrencaee 535 49.1 


High-school history. When the scores from paired tests in high- 
school history were correlated, the coefficients presented in Table XI 
were found. 

The relationship between the paired tests which covered a short 
unit in high-school history was the highest found in the teacher- 
made test phase of the study. The mean r in this case is .711. The 
relatively high degree of relationship shown in Table XI may, in 
part at least, be accounted for by two facts: first, of the four teach- 
ers who constructed the tests used, three taught in the same high 
school and had frequent conferences concerning testing procedures. 


Correlation Analysis of Results 41 


Tas_e XI. HicgH-ScHoot History—CorrELATION BETWEEN INFORMAL 
OBJECTIVE Tests WuHICcH Cover IDENTICAL TEXT MATERIAL 


Tests r N 
RU EL SET WEN Eaters sfelecsraisi<\chcicve,s/slereiateleva ofsssyes ciara . 837 32 
PEAS EA NVA Sy bd Pate aah cnx oni ntore ots oini slate © wianb!ayeic\es0%e:2 .820 32 
UIE ES FA-MENO EA rata terate a echeis a 1a s'e.s)0.5 5 ciate ainie\eis(s aie aia 0's ee 33 
BUA pid eed EASE IN Pepe Ieee 3 2.2 fas nek eco:s oi sce: 2) # Wis) eimielend = 725 34 
ET ey Ede VEN EL ah cyenats s/o! a wi 3 /=,eve. praia si0; =) s00)s 2 sisi e sidieve’s.s -669 32 
MENS EA MViE LO Etats te tetenet-Yo/aicis g/s'0 a'= fala cocc't tne soca rorov -486 34 

IVC a TaN Rye atarecaretatohcksss isco el stevens fot ml 32.8 


Although the teachers did not confer when they were constructing 
the tests used in this study, the unusually close contact between the 
teachers very likely decreased the variability which results from the 
selection of items for a test. In the second place, the testing unit 
was relatively short, whereas the tests were long and specifically 
factual in nature. 

Variation in technique for high-school history. As a further 
check upon the amount of disparity between objective tests in high- 
school history the four high-school history tests were paired and 
given to groups of pupils who were not the pupils of either of the 
teachers who constructed the tests used.* Tests IHSHVI and 
IIHSHVI were considered as one pair and were given to a group 
of thirty-six pupils. Tests IITHSHV2 and IVHSHV2 constituted 
a second pairing and were administered to forty-four pupils. When 
the scores were correlated the following coefficients were obtained: 


Tests r 
TEARS TESTO AUS NIMES SHRM cic) Aves Steves tends aradieia ereleverensieisve,cie oe vee .654 
YATAIAESTISS Feta U8 VAIS AAs he erence a iekes shee lci-el suave -ctarsin a eeyelee silos Ge 6 341 


Reference to Table XI shows that the first of the two pairings just 
listed correlated .725 when the two tests were administered accord- 
ing to the regular procedures as compared with .654 when the varia- 
tion method was used. Even more striking is the fact that the 
second variation pairing showed a correlation of .820 (Table XI) 
when administered according to the regular procedure, as com- 
pared with a correlation of .341 when the method was varied. These 
facts seem to constitute some evidence that the relatively high re- 
lationship revealed in Table XI was due in some degree to factors 
other than those in the tests themselves. 

Sixth grade health tests. A pair of health tests was administered 


*See p. 34 for a detailed description of the procedure. 


42 Variability in Results from New-Type Achievement Tests 


to a group of one hundred pupils in the sixth grade. The correla- 
tion between the two sets of scores was .402. che 

Summary for all identical text material tests. Table XII pre- 
sents a summary of the forty-three coefficients of correlation which 
resulted from the correlating of paired tests constructed to cover 
identical units of text matter. 

Table XII reveals that paired tests were administered in history 
in the fifth and sixth grades and in high school; in geography in 
the fifth and sixth grades; and in health at the sixth grade level. 
In column 3 the mean r for each subject-grade group is given. Of 
the eight mean coefficients presented, six are below .55. The mean 
of the mean coefficients as shown in Table XII is .51. This sum- 
mary mean is based upon the correlation between the scores from 
forty-three test groups which involve a total of 2,255 pairs of indi- 
vidual test papers. 


Tasie XII. SuMMAry oF CorRELATION BETWEEN TESTS Wuicu Cover 
IDENTICAL TEXT MATERIAL 











Number of 
Subject Grade Mean r Mean N Test Groups 
Geography ernest nner 5 -523 61 10 
(Geographyanyaieciienaeee 6 -508 64 7 
RLISCONY Soe nears oe 5 -468 47 8 
Fistory5 ht. ce eee 6 2535) 49 9 
Elistory;y 02 ee ce ec eile S.HLS. 711 32 6 
HListorvay Linea ayer S.H.S. -654 44 1 
irstonyav2 oe he ie S.H:S: 341 36 1 
ealthtnser ata nee 6 -402 100 1 
Summary mean\s apne ear ia ee eeneeen oie 54 





CORRELATION BETWEEN TESTS CONSTRUCTED TO MEASURE A 
SEMESTER OF WORK 


The results already presented show that the relationship be- 
tween paired objective tests which were constructed to measure 
identical bodies of text matter expressed in terms of a coefficient of 
correlation is approximately .50. The question arose as to what 
the relation would be if semester examinations were used instead 
of tests constructed to cover shorter units of work. In order to il- 
luminate this problem, paired semester tests were administered to 
twenty-five test groups, and the two sets of scores secured for each 
group were correlated. 

Seventh grade geography. Ten test groups were given paired 


C orrelation Analysis of Results 43 


semester tests in seventh grade geography. The correlations found 
are shown in Table XIII. 

The range of the coefficients is relatively large in this case. The 
highest correlation is approximately .85, whereas the lowest shows 


a slight negative relationship or a coefficient of —.21. The mean co- 
efficient is .413. 


Taste XIII. SevenTH GRADE GEOGRAPHY—CORRELATION BETWEEN INFORMAL 
OBJECTIVE Tests WuiIcH Cover A SEMESTER OF WoRK 











Tests r N 
Ree FRM Me sed a eter ale aie. «eis rijanayx ¥fors 1d: aiecn die) n,0,04\0 845 a9; 
NSH Mt) Cr Mere estefan rate prar asia ect apchais) «:svevelercse iy potter .657 43 
Ger ceed bree rece arePeivic clans sxe)sheisihayaivils ole ales aia -632 40 
MUG rican Cac IST VAG] les cscs ofohe (A125, syaseks dis aie ayaa. tia’ .594 40 
AG mV vere te etc see etsie avai its sierosciccasacnts, clots, «fete 's.5 -581 40 
kee Mme TV Cx irate erete ciaysiciss oy esa/0's assis ala) ole o:(alei wa anaes 393 25 
TeV AED EGrirteiartee rabciave se o066 el Siestiescisereselevt eye sin e\s) ove . 284 37 
IESE ye Grater Prieto Yn fa) bisa, 9 fckaCc'el aYarstet sfonsiele aus ait. ere .246 34 
Mea Cee Pea ae ee IST rot et pcs apie lh fae fev avo ais uw 0 doe dans SLT, 28 
Gre emer a pieta mietene cia) sisiaialatole sin ea) eafeia(s! severe} —.212 36 

ORM ara er aerate level steietc, ifeist susie chs laid arate yoo 413 36.2 





Seventh grade history. Table XIV presents the corresponding 
facts for seventh grade history. 


TasLe XIV. SEvENTH GRADE History—CorrELATION BETWEEN INFORMAL 
OsyjeEcTIvE Tests WHIcH Cover A SEMESTER OF WoRK 








Tests r N 
MU Me Ayre messMIIMEN eve sec ttencyevetey esate rsixtsl ate’ ste<ia'aisavcVp/etera)<rayess .590 34 
MUNA M BUGLE pe teieiciste a ciere sreiereisiels xis sisha tia ove ection -643 27 
HUTTE Nieves LIRA COPA Wyarsveyelcvete arace\s isle cistern sur G ieletiavemscls -642 32 
EI CEO NED Aan tremtet Nereis (sos rareics state totaqstsapaslataiecs alain es .616 40 
NGVIE NaN cL /fettetcrctancvalssrerenecotescisie eleven: setters pia celaeunyei a .600 28 
EU amen VEA emaen A cycle feretetetas a/chelafelors: oi sis clevercieyeile wie 589 39 
OEM Fmt LTE fatty ckstov ate ciretpieteva) ayo inslicsce\ se e5e)ebelsveusi anes 465 36 
AEG VILAT Mito e ete icia ean nals aces ete Sar eae isles Bales -398 31 
MUA alee WEL aererterteaiileraie arsine osu ac aalsis aloe Boa 44 
VET i(e—ae IEA hea eres arr ayetestalevaalsy siavale)steyeiscsleal=,0 »)0) <5 aia -309 40 
TEL eal ed meer sve eta cetalelc(eeieselstaxscsietersie se 5/3), « .078 36 
TES (Comme ULE Ley Cheater erst P ore fevela clas s3\v7e\eyasaveye\i je) e sso. -053 35 
UTTER iclmm VE ecpetay-vareee ee nes cetera sie eye evens 5 oe tegeties sre) —.092 36 

Wlesiriterteeenctetstassiove crac ietevtis ielesefe(arevsvo,fohars 401 35.2 





Tests were administered to thirteen test groups in seventh grade 
history. The mean of the thirteen coefficients is .401 and the me- 
dian is .465. 

High-school history. Only two pairs of semester tests were given 
in high-school history, but the two coefficients secured are of inter- 


44 Variability in Results from New-Type Achievement Tests 


est, because the fact that the relationship is quite similar in magnitude 
to that found in other phases of the study supports the general results. 
The four semester tests were paired as follows: 


1. IHSB-ITHSB 
2. IHSF-IIHSF 


Correlation of the scores from the first pair of tests (No. 1) re- 
vealed an r of .276, and of the second pair of tests (No. 2) anrr 
of .420. The mean coefficient for the high-school semester tests is 
approximately .35. This relatively low relationship becomes more 
interesting when one recalls that the four tests used were perhaps the 
most detailed and carefully constructed of the semester tests used 
in this study. Reference to Table II* reveals the fact that the four 
tests were composed of the following numbers of items: 


Test No. of items 
DEL SB os oe vise pin tise ase wlele: cing thane ay a eta ee 105 
TT EDS Bic sais sissiwa'< sinis: oy tystionh oieyhy Un nt easier tee en 65 
TELS soo ose oio:s ova oasecetinn send ade cose ea 100 
TELS Es. ise: o:s:e:e s/5;0-0'4! nis ene say Mie tslose ahah i eye eso ae 80 


Thus the lack of relationship revealed can hardly be attributed to 
the length of the tests. 

Summary for all semester tests. Paired semester tests were given 
to twenty-five test groups, ten in seventh grade geography, thirteen 
in seventh grade history and two in high-school history. The fol- 
lowing mean r’s were obtained: 


Subject and grade Meanr 
Seventh grade geography... 0. 0\.04.. cn. +++ +9) gee 413 
Seventh grade history... ...../....s0000 secs» ose 401 
High-school history «0%... 0.002% sss 0 see 348 


The mean of these three means is .387. That is, the average rela- 
tionship between twenty-five paired semester tests is approximately 
represented by a coefficient of .40. It is of interest to recall at this 
point that the comparable mean for the shorter unit tests was ap- 
proximately .50. 


CORRELATION BETWEEN PAIRED TESTS IN GEOGRAPHY 


Table XV shows the correlation between all (semester and shorter 
unit) tests in geography. 

Paired tests were given to groups of pupils in fifth, sixth, and 
seventh grade geography. Table XV shows the mean correlation 
for each of these grades and the mean of these means. The sum- 
mary mean which is based upon twenty-seven coefficients is .481. 


Correlation Analysis of Results 45 


TABLE XV. MEAN CorRRELATION FOR ALL Parrep TESTs IN GEOGRAPHY 


Subject Grade Mean r Mean N 
REORER DIY se rcarslasa ale cieveielsialteve.ciay 5 -523 61.2 
RED RTAD UV ee seiaisaiciticiic «cic. ois vic 6 -508 64.0 
RRPORERD Yor sieftreeiaislsicivie's = o°s.ec cre 7 -413 36.2 
Mlerixi OVEMICATIE De fotc isis isiviel« je\<)2 he vieleia's 0) e/eralerw ase vidoe vials -481 53.8 


CORRELATION BETWEEN PAIRED TESTS IN HISTORY 


How does the relationship between the geography tests compare 
with that between history tests? The facts presented in Table XVI 
answer this question. 


TABLE XVI. MEAN CORRELATION FOR ALL Pairep TEsTs IN HisTory 








Subject Grade Mean r Mean N 
EDin bot Cer meta sary isin steyssc n)<« <icvays 5 .468 47.0 
PPRIBEE We rreetaccte cincis sisis elcisieisiso* 0s 6 -535 49.1 
HEA LEON cere ert cRcenter =) okt s)=( of) assiies’sib oes 7 .401 36.2 
ANOLON Pstreiclcfelarstarsiavsieseie\cra‘eie's's/o\e S.H.S. -515 33.9 
Mera OL MCA lin rete ices. cteyrcyeroine © aie.tresTels le s'eia ie adie tere -479 41.3 


The mean of the means for all the paired tests in history (forty 
test groups) is .479. The fact that this figure is strikingly similar 
in magnitude to that obtained for all geography tests should be 
noted, for this similarity of the two summary means is evidence 
that the relationship found approximates the relationship which ac- 
tually exists between informal objective tests. 


RESULTS FROM ALL TESTS 


Sixty-eight pairs of tests were given in connection with the study 
of teacher-made objective tests, twenty-seven pairs of tests in geog- 
raphy, forty pairs in history, and one pair in health. Table XVII 
shows the mean correlation for each subject-grade group and the 
average of these means. 

Table XVII indicates that the mean of the eight means which 
are based upon sixty-eight coefficients is 470.4 The most convincing 
feature of Table XVII is the similarity in magnitude of the eight 
mean coefficients. No mean coefficient is less than .40 and none is 
above .54. This consistency of the mean r’s which are shown in 

*The writer is aware that there has been criticism of averaging coefficients 
of correlation. However, one should keep in mind that the purpose of com- 


puting this mean is to indicate a tendency in the extent of relationship which is 
found when a number of correlations are considered. 


46 = Variability in Results from New-T ype Achievement Tests 


Taste XVII. SUMMARY OF CORRELATIONS BETWEEN Patrep Tests For ALL 
SuBJECTS AND ALL GRADE LEVELS 








Number Test 

Subject Grade Mean r Mean N Groups 
GeORTADDY.< sle.csateiviren eye's rs 5 .523 61.2 10 
Geomranbycteaitesaseshen es 6 -508 64.0 7 
Geography dideicicrtccicinns 7 413 36.2 10 
Historyy. true hauniencitue ten 5 -468 47.0 8 
PAs toys carercieiicurastsisrac : 6 535 49.1 9 
UIRtOry.n. sce cease masters 7 401 452 13 
FRA TOLY otste lc enimteictore 6 steerer iste S.H.S. 515 33.9 10 
Healthtes sic scfeuteswistesseie 6 .402 100.0 1 
Mean‘ofimeansiss: cs cetscieekeebice hee teenoe -470 533 


Table XVII seems to warrant the following tentative generaliza- 
tion: The correlation between teacher-made objective tests con- 
structed by different teachers to measure pupil acquaintance with 
the same subject matter and scored objectively is approximately .50. 


TRADITIONAL AND OBJECTIVE TYPE TESTS COMPARED 


A study of traditional (essay) type tests, very similar to the study 
of objective type tests reported in this chapter, was made by Mon- 
roe and Souders.® The availability of these data makes possible 
an interesting comparison. Table XVIII presents this comparison. 
The first half of the table is a summary distribution of the results 
from Monroe’s and Souders’ study of paired traditional type tests. 
Only those data directly comparable with the findings of the pres- 
ent study (of new-type tests) are presented. The second half of 
Table XVIII is a summary distribution of the coefficients found 
when scores from paired objective tests were correlated in the pres- 
ent study. 

The median correlation for the traditional type tests (thirty- 
four coefficients) is .65, whereas the corresponding median for the 
objective type tests (sixty-eight coefficients) is .54. 

The difference between the two medians is not great, and there- 
fore it is not wise perhaps to conclude from these facts that ob- 
jective tests show more variation than traditional type tests; how- 
ever, the data presented in Table XVIII warrant the conclusion 
that objective tests as used by classroom teachers are probably no 
more free from factors which cause variation as measured by the 
methods used in this study than are traditional type tests. If one 

® Monroe and Souders, of. cit., p. 77. Monroe’s and Souders’ data are 


quoted by Ruch, op. cit., chap. ili, in connection with his discussion of “objec- 
tions to traditional type tests.” 


Correlation Analysis of Results 47 


Taste XVIII. Comparison or DistriBuTIoNs oF CorRELATIONS For TRADI- 
TIONAL TESTS AND FOR OBJECTIVE oR NEW-Type Tests 





, Traditional Type Objective Type 
Size of r Test* — Frequency Test — Frequency 





w 
al 

Orr COCCOCOF OC COF FOF YB NN RA FF NH HOO 

eB OCC F ONK KF YVSNHYUYNA NH FE UOUUON RK WOOO 








BIL Cel lteteyararasctencectnyeycicatiaeretsvaseyate- eve pias 34 68 
Med fartieteyartsteists lore stalayd'aiaiscels avele Gives 65 54 





*Data adapted from Monroe and Souders, op. cit., pp. 32, 33. 


may judge by the literature, the general opinion of writers on educa- 
tional measurements has been contrary to this conclusion.6 Although 
no facts have been available on this point, the assumption that the 
new-type or objective test as a measuring instrument is less sub- 
jective or variable in general than the essay type test seems to have 
been widely accepted. The findings of this investigation are at least 
a beginning toward the substitution of relevant facts for unsub- 
stantiated opinion. 

Further, the facts given in Table XVIII seem to indicate that 
objectivity in scoring (the distinguishing mark of the objective test) 
as a feature of testing does not decrease significantly the disparity 
in the results from such tests. That is to say, objectivity in scoring 
seems to be a relatively unimportant factor in determining the ex- 
tent of variation which grows out of the measurement situation as 
a whole. 


®°See E. V. Pullias, “A Study of Current Opinion Concerning Objective 
Tests,’ Educational Method, XVI (April, 1937), 348-356. 


CHAPTER V 


PERCENTILE Pornt ANALYSIS OF RESULTS FROM PAIRED 
INFORMAL OBJECTIVE TESTS 


Purposes of chapter. In Chapter IV the data based on paired 
teacher-made objective tests were analyzed by the correlation tech- 
nique. In this chapter the same basic data are analyzed by what may 
be called the percentile rank disparity technique. This method has 
been described and illustrated on p. 36. Briefly stated, the pro- 
cedure consists of comparing the percentile rank position of a pu- 
pil on one test with his corresponding position on a second test 
when both tests were constructed to measure knowledge of identical 
material. The purpose of the percentile rank analysis may be clari- 
fied by illustration. If a group of children were weighed on two dif- 
ferent scales and percentile rankings determined for each series 
of weighings, one would expect relatively little disparity between 
the two rankings. The child who ranked heaviest as weighed by 
one set of scales would, within limits, rank heaviest as weighed 
by the second set of scales; therefore, the differences in percentile 
ranks would approach zero. If objective tests are used as measur- 
ing instruments what will be the extent of the percentile rank dif- 
ferences? To answer the question just stated is the central pur- 
pose of this chapter. 


TESTS WHICH COVER IDENTICAL UNITS OF TEXT MATTER 


Fifth grade geography. Paired tests were given to ten groups of 
pupils in fifth grade geography. In order to illustrate the technique 
used the percentile rank disparity found between these tests is shown 
in Table XIX. 

One should remember when reading Table XIX that the step 
intervals are percentile rank differences. For example, the 7 in the 
frequency column opposite 80—- indicates that the difference between 
this pupil’s percentile ranks on the two tests which he took was 
approximately 82.5 (midway in the interval, 80-84). He had a 
percentile rank of ten on one test and ninety-one on another. The 
cumulative frequency column shows the cumulated frequencies read- 
ing downward on the step interval scale. For example, sixteen of 
the pupils were 65 or more percentile points apart on the paired tests. 
The cumulative percentage distribution is calculated in the same gen- 


Percentile Point Analysis of Results 49 


Taste XIX. Firth GrapE GroGRAPHY—DisPARITy IN TERMS OF PERCENTILE 
PoINTS BETWEEN INFORMAL OBJECTIVE TeEsTs WHICH CovER 
IDENTICAL TEXT MATERIAL 








Percentage Cumulative 
Percentile Frequency Cumulative Distribution Percentage 
Points All Tests Frequency All Tests Distribution 
Bare ai eyeceret stats nists 0 0 0 0 
UU ote, ficve infers anise: 0: 0 0 0 0 
Sata iotsl cee oie ereis wis 0 0 0 0 
Re alatstatel aha tesm, aksns 1 1 0.2 0.2 
Saale at ata yo: 056 1 2 0.2 0.4 
A teva etal a s.atovaysin 6 8 1.0 1.4- 
Gall aheis refs Sy <'0/21s 8 16 13 Zed 
TA aa era err aie aretha 11 27 1.8 4.5 
Dera tit sconce 2 36 15 6.0 
ee carer ser are14 16 17 53 2.8 8.8 
Se oct a iayel a asa<'avs:6\< 19 72 Bel 11.9 
nate ee eis Jone 20 92 3nd 15.2 
Boe te eysie's Slaosidie eve 33 125 5.4 20.6 
BO rctaiatatere aa tsrerore 36 161 559 26.5 
Da enelsig Sitan soc 54 215 8.8 35.3 
 sistaiolcyi cvs! sy0: 0:4 62 277 10.1 45.4 
i Maveca avis iaiohetiveare 73 350 1:9 57.3 
LO oa sicis as craielers 71 421 11.6 68.9 
ee ratetayalersiets annie) 2 99 520 16.1 85.0 
eee cicieisisioieictese! 92 612 15.0 100.0 
UNfat etayeteta/s)avo)xisiase = 612 100.0 
IVER eet ed ain oc 2105 





eral manner as the cumulative frequency column. As an illustration, 
in the case of 8.8 per cent of the pupils their percentile ranks showed 
a disparity of 50 or more percentile points. 

Table XIX shows that the mean percentile rank disparity for the 
ten test groups is 21.5. One hundred and twenty-five pupils, or ap- 
proximately 20 per cent of the group, showed a percentile rank 
difference of 35 or greater. More than one-third (35 per cent) of 
the group revealed a minimum rank difference of 25 percentile points. 

Other identical unit tests, Data! similar in nature to those pre- 
sented in the foregoing section were obtained for sixth grade geog- 
raphy, fifth and sixth grade history, high-school history, and sixth 
grade health. The consistency of the extent of percentile rank dis- 
parity is of interest. One will recall that approximately one-third 
(35 per cent) of the pupils in fifth grade geography were 25 or more 
percentile points apart on the two tests. The corresponding figures 
for the remaining subject-grade groups are as follows: sixth grade 
geography, 37 per cent; fifth grade history, 39 per cent; sixth grade 


* Complete data for each subject is not presented because of space limitations. 
Anyone desiring such facts should correspond with the author, 


50 Variability in Results from New-Type Achievement Tests 


history, 36 per cent; high-school history, 29 per cent; sixth grade 
health, 42 per cent. 

The consistency of the distributions is shown even more clearly 
in Fic. I. Note how nearly the curve for each subject-grade level 
represented in Fic. I approximates at all points the curves for the 
other subject-grades. As an example, in the case of all subject-grade 
levels approximately 50 per cent of the pupils showed a percentile 
point disparity of 20 or more. 


Figure I. PERCENTILE CuRVE For IDENTICAL Unit TESTS 
DIFFERENCE IN PERCENTILE RANK 


oH 





Cece 
Eee 





ee ea era 
CECA 
PH 









































PER CENT 








ISISTSISTEISISS INET 




















KTS 
LT NETS 














‘ 
N 










































































Summary for all identical text material tests. The percentile 
rank disparity revealed when forty-three groups of pupils (2,255 in- 
dividuals) took paired tests constructed to cover identical text mate- 
rial is shown in Tables XX and XXI. 

Table XX presents a composite distribution based upon the forty- 
three group distributions. The range of percentile rank disparity as 
shown in Table XX is from 0 to 99. There are cases in every in- 


Percentile Point Analysis of Results 51 


TasLeE XX. SUMMARY DISTRIBUTION—PERCENTILE Pornt Disparity ror ALL 
Tests WuHicH Cover IDENTICAL TEXT MATERIAL 








Percentage Cumulative 

Percentile Frequency Cumulative Distribution Percentage 

Points All Tests Frequency All Tests Distribution 
Sa NR cesta ye 3) 07 1 1 0.04 0.04 
Symmes = Seis cisyisiess’e'a's l 2 0.04 0.08 
Boma necielsvela’’sva ote a 5 0.13 0.21 
Bea eleletae dais s 3 8 0.13 0.34 
eras dpci ast ajesd sible 8 16 0.35 0.69 
Me seratcrt sieiaeiersiesis 16 32 0.71 1.40 
GN yee lejs owieeret 28 60 1.24 2.64 
Bahay ale averse) ore lercie ys 45 105 2.00 4.64 
Sm aia claic civisiy nts os 36 141 1.60 6.24 
Baie ates atas Vota. sh ates 64 205 2.84 9.08 
a crelataressghidse ehéje« 70 275 3.10 12.18 
ae tae craters 99 374 4.39 16.57 
SB a chart sre vera satevels 126 500 5.59 22.16 
BD Nie isrerercschecevereind 144 644 6.39 28.55 
DB sos aisisisiaialvieis-aie 170 814 7.54 36.09 
de oS dee nctchake (ats ers 220 1034 9.76 45.85 
MG ics cusievels a eres 266 1300 11.80 57.65 
BITS ieee pile vrescisie 249 1549 11.04 68.69 
Bo sie el diners evs. e1e' « 369 1918 16.36 85.05 
Ce erersroiriereateters sss 337 2255 14.95 100.00 

INR Nctcec sic 2255 100.00 
IME ats cata. eicretarayesaisie DID 





terval from 0-4 to 95-99. The mean disparity is 22.2. Table XX 
shows that 205 pupils, or approximately 10 per cent of the group, 
had percentile rankings on one of the tests taken which varied 50 
or more percentile points from their rankings on the second test. 

Five hundred pupils, or somewhat more than one-fifth of the 
group, showed a disparity of 35 or more percentile points; and dis- 
tinctly more than one-third of the total group revealed a disparity 
ranging from 25 percentile points upward. 


TaBLeE XXI. SuMMARY oF MEANS—PERCENTILE Potnt Disparity FoR ALL 
Tests WuHiIcH Cover IDENTICAL TEXT MATERIAL 





Mean Percentile Number of 

Subject Grade Point Disparity Mean N Test Groups 
Geography:...jc .csiedcee sees 5 21.5 61.2 10 
Bcopraphysycrscs iss sissies 6 22.9 64.0 7 
PAIELONY Sci cisie eretistaie ti sicleie ses 5 23.0 47.0 8 
EAB EOTY 25s (orrarsinlcractorsioi ct eines 6 21.6 49.1 9 
PAIRROLY seer eicsaieiiyesiaiolelas ae SHS Mifaah 32.8 6 
Eta omya Vict c\cicistsiciess aval ava SHS 23.3 40.0 2 
veal thst ictiars sieve selaraa/ars 6 25.8 100.0 1 





52. Variability in Results from New-T ype Achievement Tests 


Table XXI is a more general summary of the facts presented in 
Table XX. The mean percentile rank disparity for each subject and 
grade level is shown. The greatest mean disparity found was 25.8 
(health), and the least was 17.7 (high-school history). The remaining 
five means are very similar in size, clustering very closely around the 
mean of the means. The consistency in magnitude of these means 
seems to indicate that they are relatively reliable, that is, that con- 
siderable confidence may be put in a generalization based upon these 
averages. The summary (based upon forty-three test groups) is 
22.2 percentile points. 


TESTS CONSTRUCTED TO MEASURE A SEMESTER OF WORK 


The findings reported in the foregoing part of this chapter show 
that the mean amount of disparity in terms of percentile rank dif- 
ferences between shorter unit tests is approximately 22. Since the 
purpose of the investigation was to determine the amount of dis- 

Figure II, PERCENTILE CurvE ror SEMESTER TESTS 


DIFFERENCE IN PERCENTILE RANK 
100 


so} 4 




















7o HEH : 
eo HEEEEEEE HEE 


SOs aisle 




























































































40 ; 


PER CENT 




























































































Percentile Point Analysis of Results ad 


parity between objective tests as they are generally used in school 
situations, the next step consisted of extending the study to semester 
examinations. 

How does the disparity between paired semester tests compare 
with that shown to exist between tests which cover shorter units of 
work? In order to answer this question, paired semester tests were 
administered to twenty-five test groups (878 individuals) and the 
amount of percentile rank disparity was calculated. 

Seventh grade geography, seventh grade history, and high-school 
Iustory. The percentile rank disparity for groups of pupils in sev- 
enth grade geography, in seventh grade history, and in high-school 
history was determined. These distributions are very similar to 
those for shorter unit tests. On the whole the disparity was slightly 
greater in the case of semester tests. The consistency of the semester 
test distributions is shown in Fic. II. The three curves tend to be 
very similar. As an illustration, the point 20 on the percentile rank 
difference scale in each case corresponds approximately to the point 
50 on the percentage scale; that is, in each of the three groups rep- 
resented by the curves in Fic. II, about 50 per cent of the frequencies 
appear at or above the point 20 on the percentile rank difference 
scale. (See Fic. I for consistency between the two sets of curves.) 

Summary for all semester tests. Summaries of the disparity be- 
tween all paired semester tests are presented in Tables XXII and 
XXIII. Table XXII is a summary distribution based upon twenty- 
five distributions for smaller groups. 

Taken as a whole, the semester examinations revealed very nearly 
the same amount of disparity as the tests which covered shorter units 
of work. The mean for the semester tests is 23.9, as compared with 
22.2 for the shorter unit tests. The distribution of frequencies is 
very similar in both cases. (Compare cumulative percentage dis- 
tributions in Table XX and Table XXII.) 

Table XXIII gives the means for the three groups of semester 
tests. 

The chief purpose of bringing these means together is to empha- 
size their consistency. Reference to Table XXIII indicates that the 
range is slightly less than 3 percentile points. The data gathered 
from the study of semester tests seem to warrant the conclusion that 
these tests show practically the same amount of disparity as tests 
constructed to cover shorter units of work. 


54 Variability in Results from New-Type Achievement Tests 


Taste XXII. SumMMARY DistrRisUTION—PERCENTILE Point DisPARITY FOR 
ALL Tests WuicH Cover A SEMESTER OF WoRK 








’ Percentage Cumulative 

Percentile Frequency Cumulative Distribution Percentage 

Points All Tests Frequency All Tests Distribution 

Dis iateteialeces aig vive 0 0 0 0 
Oman ameieraas teres 2 Z 0.23 0.23 
Rome netaeniacrreantet 2 4 0.23 0.46 
SOs cisisiowisislavolecre's 5 9 0.57 1.03 
Timaancreieetaec nevada 5 14 0.57 1.60 
LO mseeiels acateelase ot 7 21 0.80 2.40 
Oo=—sasbinsnelains 15 36 1.70 4.10 
CO aaa ha ntcome arene 20 56 2.28 6.38 
Soren a crane diestusotels 19 75 2.16 8.54 
Wimmer ein pista uleyetae ess 39 114 4.44 12.98 
MS a rey seters osainia aco,he 31 145 3.253 16.51 
Bm iar cine eistaiale aioe 43 188 4.90 21.41 
Bom creel atanie Seely 39 227 4.44 25.85 
Bae ineearesessisied 60 287 6.83 31.68 
DS mae alto staiclely weft 60 347 6.83 38.51 
Day ei ieirticeererys 81 428 9,23 47.74 
Se ie(aieteya/avavetell 85 513 9.68 48.42 
Om apiareceinvetsiavs aeYs 103 616 11.74 60.16 
Baas ie. afk cw iicters ests 120 736 13.67 73.83 
Qt ie oiiaereeron: 142 878 16.17 100.00 
INierines eltets lsat 878 100.00 
1 Gabon deeasac 23.9 





PERCENTILE RANK DISPARITY BETWEEN PAIRED TESTS IN GEOGRAPHY 


The percentile rank disparity between all tests (semester and 
shorter unit) in geography is shown in Table XXIV. 

Groups of pupils in the fifth, sixth, and seventh grades took 
paired tests in geography. Table XXIV shows that the mean 
amount of disparity for all geography tests is 23.3. This mean is 
based upon the results from twenty-seven test groups. 


PERCENTILE RANK DISPARITY BETWEEN PAIRED TESTS IN HISTORY 

A comparison between the disparity found in history with that 
found in geography is of some interest. The facts for history ap- 
pear in Table XXV. 

Paired history tests were administered in the fifth, sixth, and 
seventh grades, and in high school. The mean amount of disparity 
for all history tests is approximately 22 percentile points (Table 
XXV). The lowest mean is 20.0 and the highest is 23.7. The cor- 
responding range for geography is 21.5 to 25.4. The summary 
mean for geography is 23.3. The difference between geography and 
history tests with respect to range and mean of percentile rank dif- 
ferences is so small as to warrant no further comment. 


Percentile Point Analysis of Results 55 


TasLe XXIII. SumMAry or MeAns—PERCENTILE Point DispARITy FOR 
Att Tests WHICH CovEeR A SEMESTER OF WorRK 





Mean Percentile Number of 
Subject Grade Point Disparity Mean N Test Groups 
RSCOBTADHY cletinteieieis @ieteseresatei 7 25.4 36.2 10 
SPUNRTION eae etelaha, visa ska ace, 7 2327 35.2 13 
RAIAtONV ya raster vere aieseore SHS D2 29.0 2 
Rleantotemes natn tet cleicio1=)orspare.e.sre ets s are/acarers 23.9 3355 





Tas_e XXIV. Att GreocrApHy Tests—SUMMARY OF MEAN PERCENTILE 
Point Disparity 





Mean Percentile Number of 
Subject Grade Point Disparity Mean N Test Groups 
Geographyieiesc asuuicc es se 5 DGS 61.2 10 
Geapraphysrcictsteentcrieisieers sis 6 22e9 64.0 7 
Geography: asreicscscessee «cs i 25.4 36.2 10 
Meant ofnmMeane cmrisiies s\esetereie ain isieie siare seiciste wi 23.3 53.8 





FINDINGS FROM ALL TESTS 


Sixty-eight groups of pupils (3,133 individuals) were given paired 
tests in connection with this study of percentile rank disparity. The 
tests were distributed as follows: geography, twenty-seven pairs; 
history, forty pairs; and health, one pair. A distribution of the re- 
sults from all tests is given in Table XXVI. 

An analysis of the cumulative columns of Table XXVI gives 
the most revealing picture of the findings. Three hundred and nine- 
teen pupils, or about 10 per cent of the group, varied 50 percentile 
points or more. Approximately 10 per cent in each of the distribu- 
tions presented already have consistently varied 50 or more percentile 
points. Five hundred and sixty-two, or 17.9 per cent, showed a 
disparity ranging from 40 percentile points upward. Finally, one 


TABLE XXV. Att History Tests—SuMMARY OF MEAN PERCENTILE 
Point Disparity 











Mean Percentile Number of 

Subject Grade Point Disparity Mean N Test Groups 
History activated eccieee 5 23.0 47.0 8 
MEAs COnysspote ereveststceers fare dav ata rete 6 21.6 49.1 9 
ELASLONY Ayedey Ver acais loi eie/s:ciaiarese. 7 Diet 35.2 13 
FELIS CORY reiete essa -isi« Ste sinners SHS 20.0 33.9 10 





UWeamiotemeanar ect uas ctnerdetersaicicicie: etic o/eiache's 22.1 41.3 





56 Variability in Results from New-T ype Achievement Tests 


Taste XXVI, Att Tests—SuMMArRy DISTRIBUTION OF PERCENTILE 
Point Disparity 














Percentage Cumulative 

Percentile Frequency Cumulative Distribution Percentage 

Points All Tests Frequency All Tests Distribution 
Domb varnietihnianiats ve 1 1 0.03 0.03 
ORANG TERIOR 3 4 0.10 0.13 
Bo aside lareieaies iD 9 0.16 0.29 
BOs ie iiccitinneen 8 17 0.26 0.55 
Jams davien sti\d soeetetote 13 30 0.41 0.96 
TO oresalgaeininnor ails 23 53 0.73 1.69 
CB eerrectnoinarsnttiers 43 96 137, 3.06 
60 here te Selec 65 161 2.07 543 
SB weihinvaternbtacvtete 55 216 1.76 6.89 
BO tone cachioeck 103 319 3.29 10.18 
SS sisheieietaiore ete 101 420 Bro 13.40 
MQ ie NS cece 142 562 4.53 » “7293 
Bossa terete rnmetdin ae 165 727 Sen 23.20 
BOR ice hein 204 931 6.51 29.71 
DS; tochiliniins alae 230 1161 7.34 37.06 
0s distncie ce cpoevene 301 1462 9.61 46.66 
GABOR nctia. 351 1813 11.20 57.86 
LO Ssiiidia dissictecaisiaie: 352 2165 11.24 69.10 
Sesion aurea 489 2654 15.61 84.71 
Orrin teehee 479 3133 15.29 100.00 

ING ee ierslaiciacatnversts 3133 140.00 
MDE Sire daraete teres 23.0 


should note that 1,161 pupils, distinctly more than one-third (37 
per cent) of the group, showed a minimum disparity of 25 percentile 
points, one-fourth of the possible disparity. 

A summary of the means for all the grade-subject groups is given 
in Table XXVII. 

The summary mean is 23.0. This mean is based upon sixty-eight 
test groups. The most convincing characteristic of Table XXVII is 


TaBLe XXVIII. Att Tests—SumMMAryY or MEANS oF PERCENTILE 
Point Disparity 


Mean Percentile Number of 

Subject Grade Point Disparity Mean N Test Groups 
Geography ay-is es tererirncnniere 5 ales 61.2 10 
Geographyss cht stiecceen 6 22.9 64.0 7 
Geographiy:iijee css scusvoistmets 7 25.4 36.2 10 
FXISEORY. 5 5)bic eta peiciecniaon sree: 5 23.0 47.0 8 
History-ti.. onc eceseaee 6 ZAG: 49.1 9 
Lis tory, ii srcterete cceretel terete 7 251, Bone, 13 
Historyis-ciysieeacee cose SHS 20.5 33.9 10 
Health... S.2 ssc anssracsone 6 25.8 100.0 1 


Percentile Point Analysis of Results 57 


the consistency of the means in respect to magnitude. The means 
cluster very closely around the central tendency, 23.0. Such con- 
sistency seems to warrant the conclusion that the amount of dis- 
parity reported in this investigation is probably very near to that 
which one might find generally when objective tests are used in reg- 
ular testing procedures. 


CHAPTER VI 


TEACHER-MARK ANALYSIS OF INFORMAL 
OxsjECTIVE TEsTtT RESULTS 


In this chapter the data which resulted from administering paired 
informal objective tests are analyzed in terms of teacher-mark! dis- 
parity. A teacher mark for a given pupil on a test is simply the 
letter grade to which that pupil’s score on the test is equivalent. Each 
pupil who took paired tests had two teacher marks or letter grades. 
To show the extent of disparity or difference between these two sets 
of marks for the subjects and grade levels investigated is the pur- 
pose of this chapter. 

The teacher-mark analysis is subject to two important limita- 
tions. In the first place, the teacher marks were based upon only 
one end of the distribution of scores. For example, the system 
most frequently used by those teachers who assisted with this study 
was based upon percentage figures distributed in the following man- 
ner: 90-100, 4; 80-89, B; 70-79, C; below 70, F (failing). Thus 
a score of 68 on one test and a score of 34 on a second test were 
both equivalent to the mark F,? and in the teacher-mark analysis 
the two marks were considered as showing no disparity. This is 
obviously misleading ; the effect is to decrease the actual amount of 
disparity. Hence the disparity reported is in reality too small. 

In the second place, the same type of criticism applies when both 
tests were unusually easy, and, consequently, the distribution was 
skewed toward the upper end of the scale. Either of the conditions 
mentioned would tend to lower spuriously the manifest teacher- 
mark disparity. 


TESTS WHICH COVER IDENTICAL UNITS OF TEXT MATTER 


Fifth grade geography. The teacher-mark disparity between 
paired tests in fifth grade geography shown in Table XXVIII will 
serve as an illustration of the teacher-mark analysis. 

The term “mark interval’ which appears as the heading of col- 
umn 1 in Table XXVIII is used to describe the distance from one 


1See p. 36 for an example of the original data. 

2In many cases the tests used in this study were so difficult (in terms of 
errors) that an unusually large number of the scores gave a percentage equiva- 
lent below 70. It should be clear that this fact does not affect the correlation 
of raw scores and the percentile rank analyses. 


Teacher-Mark Analysis 59 


TasBLe XXVIII. FirtH Grape GreocGRAPHY—DISPARITY IN TERMS OF TEACHERS’ 
MARKS BETWEEN INFORMAL OBJECTIVE TESTS WHICH 
Cover IDENTICAL TEXT MATERIAL 





Percentage Cumulative 

Mark Frequency Cumulative Distribution Percentage 

Interval All Tests Frequency All Tests Distribution 
RBiatedere er ctetcretalofee: oii 's's 58 58 9.5 9.5 
Derr ceae a ae se 109 167 17.8 27.3 
De ecw sta vsiera'e < cvevns sis 213 380 34.8 62.1 
Ucsihecedosadseas 232 612 37.9 100.0 

IN esrcratevopetsiaieverctas.< 612 100.0 


letter mark to another in the scale. A is one mark interval from B, 
two mark intervals from C, and three mark intervals from F. (Only 
four marks were used.) All identical marks are said to show zero 
mark interval disparity. The greatest possible mark interval dif- 
ference between the teacher marks from two tests is three. 

Table XXVIII indicates that fifty-eight, or 9.5 per cent, of the 
pupils who took paired tests in fifth grade geography showed a 
disparity of three mark intervals—that is, these pupils made F’s on 
one test and 4’s on another. One hundred and nine pupils, or 17.8 
per cent of the group, showed a disparity of two mark intervals. An 
examination of the cumulative columns is illuminating. One hun- 
dred and sixty-seven, or 27 per cent, received marks two or more 
intervals apart. Further, 380 pupils, 62 per cent of the group, re- 
ceived marks which varied one or more mark intervals. The mean 
mark-interval disparity for fifth grade geography is .98 mark inter- 
vals or approximately one. 

Other identical unit tests. Facts? similar in nature to those just 
presented were secured for sixth grade geography, fifth and sixth 
grade history, high-school history, and sixth grade health. The con- 
sistency of the percentages of the pupils who varied one or more 
mark intervals is of interest. It will be recalled that this percentage 
indicates the per cent of pupils whose marks varied one or more 
mark intervals. Sixty-two per cent of the fifth grade geography 
pupils varied to this extent. The corresponding figures for the re- 
maining subject-grade groups are as follows: sixth grade geography, 
62.5; fifth grade history, 72.9; sixth grade history, 65.4; high-school 
history, 50.8; sixth grade health, 82.0. 

Summary for all identical text matter tests. Teacher marks were 


% Complete data for each subject is not presented due to space limitations. 
Anyone desiring such facts should correspond with the author. 


60 Variability in Results from New-Type Achievement Tests 


secured on paired objective tests for forty test groups (2,134 pupils). 
Summaries of the data found are shown in Tables XXIX and XXX. 
Table XXIX is a summary distribution of mark-interval disparity 
based upon: the forty test-group distributions. 


Taste XXIX. Summary DistripuTion—TEacuer-MaArk Disparity ror ALL 
Tests Wuicu Cover Ipenticay Text MaTeriat 























Percentage Cumulative 
Mark Frequency— Cumulative Distribution Percentage 
Interval All Tests Frequency All Tests Distribution 
SNe sation 136 136 6.4 6.4 
Do mnieatdaradarane 446 582 20.9 2763 
ede Moana 796 1378 35.4 62.7 
OR Sires ee Nec ete ste iat 756 2134 avid 100.0 
IN ree isle, ox, osboeete 2134 : 109.0 
Miccmescatsinnces 0.96 





This table reveals that 136, or 6 per cent of 2,134 individuals, 
varied three mark intervals on paired tests. Four hundred and forty- 
six, Or approximately one-fifth of the group, showed a teacher-mark 
disparity of two mark intervals. Thirty-five per cent of the indi- 
viduals received marks one mark interval apart. The cumulative 
percentage column shows that somewhat more than one-fourth of 
the pupils were given marks which differed two or more mark in- 
tervals. Further, almost two-thirds of the group (62.7 per cent) 
showed a minimum teacher-mark disparity of one or more mark in- 
tervals. 

A summary in terms of mean mark disparity for subject-grade 
groups is presented in Table XXX. 


TasLe XXX. SumMary or MEANS—TEACHER-Mark Disparity ror ALL 
Tests WuicH Cover Ipentica, TEXT MATERIAL 











Mean Teacher- Number of 
Subject Grade Mark Disparity Mean N Test Groups 
Geographyin.ea ences. 5 0.98 61.2 10 
Geography tie. cenceaeeee 6 0.87 64.0 7 
history.enas cis ier eee 5 1.03 47.0 8 
FListory, wa. ick uch oh ae ie 6 1.02 Sita, 8 
History. anise ea nee SHS 0.76 32.8 6 
Healtts-.necnmacueaa 6 1.03 100.0 1 
Mean‘of means’. ccocckck, ck Sete eee 0.96 59.4 





The most significant feature of the data in Table XXX is the 
consistency in size of the means. The lowest mean is .76 and the 


Teacher-Mark Analysis 61 


highest is 1.08. If the somewhat special case* of high-school history 
is disregarded for the moment, the lowest mean becomes 0.87. The 
summary mean is 0.96 or approximately 1.0. One should recall 
here that because of the limitations of this technique 3.0 is the greatest 
possible disparity and that, therefore, the mean amount of disparity 
found, one interval, is one-third of the total amount possible. 


TESTS WHICH COVER A SEMESTER OF WORK 


Seventh grade geography, seventh grade history, and high-school 
history. Teacher marks were computed for groups of pupils in 
seventh grade geography, in seventh grade history and in high- 
school history. The teacher-mark disparity for each of these sub- 
jects was determined. The distributions are very similar in nature 
to those already presented, except that the amount of disparity tends 
to be somewhat greater for semester tests. Again attention is called 
to the marked consistency in the shape of the distributions. In sev- 
enth grade geography 72 per cent varied one or more mark inter- 
vals, in seventh grade history 69 per cent, and in high-school history 
70 per cent. 

Summary for all semester tests. Twenty-five groups of pupils 
(878 individuals) took paired semester tests. Summaries of the 
teacher-mark disparity revealed are given in Tables XXXI and 
XXXII. 


Taste XXXI, SumMMARY DistRIBUTION—TEACHER-MarkK Disparity FoR ALL 
Tests WuHicH Cover A SEMESTER OF WorK 











Percentage Cumulative 
Mark Frequency— Cumulative Distribution Percentage 
Interval All Tests Frequency All Tests Distribution 
Seiteree ea tie ek res 97 97 11.0 11.0 
Dest e sides 241 338 27.4 38.4 
Meese ccias aos 283 621 32.3 70.7 
ORs ebaaseerget cs 257 878 29.3 “100.0 
INS is nhstaaararee saree 878 100.0 
Ni tert ee ere eats 1.28 


A comparison of Table XXXI with the corresponding table 
(Table XXIX) for identical unit tests indicates that on the whole 
the disparity is somewhat greater in the case of semester tests, but 
the difference is hardly pronounced enough to establish a trend. 
Eleven per cent of the pupils varied three mark intervals on semester 

*See pp. 40-41 for further discussion. Also, the fact that an unusually 


large number of pupils made below 70 on both pairs of tests tends to explain 
this figure. 


62. Variability in Results from New-T ype Achievement Tests 


tests, as compared with 6.4 per cent on the shorter identical unit 
tests. Thirty-eight per cent showed a disparity of two or more in- 
tervals on semester tests, whereas the corresponding figure for 
identical unit tests was 27. Finally, 70.7 per cent of the children 
varied a minimum of one mark interval on semester tests and 62 
per cent varied the same amount on the other type of test. 

A summary of the mean teacher-mark disparity for semester tests 
is shown in Table XXXII. 


Taste XXXII. SuMMARY OF MEANS—TEACHER-MaARK Disparity oF ALL 
Tests WuHiIcH Cover A SEMESTER OF WorK 


Mean Teacher- Number Test 
Subject Grade Make Disparity Mean N Groups 
Geographyfiryiiacicsssct ner ats 7 1.15 36.2 10 
LISCOry ss chats es acevo eveyone 7 1.20 S552 13 
History's. cisccclapresrrans ince SHS 1.49 29.0 2 
Meaniofmeans’.c.ccscei, «jsaceiletaacacsrietetosaletetn orete 1.28 3325) 


The summary mean is 1.28 mark intervals. This figure may be, 
to a_ small degree, spuriously high due to the fact that the mean 
for the high-school tests is based upon a relatively small number of 
cases and, therefore, may not be representative. However, one 
should recall that the semester high-school tests were the longest and 
perhaps the most carefully constructed tests used in the study. Nev- 
ertheless, the facts at hand show that teacher marks on paired ob- 
jective tests varied about one and one-fourth mark intervals when 
semester tests were used. 


GEOGRAPHY AND HISTORY TESTS 


All geography tests. The teacher-mark disparity for all geog- 
raphy tests is shown in Table XXXIII. 


Taste XXXIII. Att GeocrapHy Tests—SuUMMARY DISTRIBUTION OF 
TEACHER-MarRK DISPARITY 





Percentage Cumulative 

Mark Frequency— Cumulative Distribution Percentage 

Interval All Tests Frequency All Tests Distribution 
Bcuiinciek akeentaee 90 90 6.3 6.3 
Dees si acsieis ae esateeie 296 386 20.8 Dian 
aba aA caGaaccAls 536 922 BVien 64.8 
Ob Soyeiscoeeneces 500 1422 B52 100.0 

INFiatey a tastaeerterarer te 1422 100.0 


Teacher-Mark Analysis 63 


Note the percentage column. Approximately 6 per cent of the 
pupils made F on one test and 4 on another; about one-fifth (20.8 
per cent) showed a disparity of two marks; and 37 per cent varied 
one mark interval. Almost two-thirds of the group received marks 
one or more mark intervals apart. The mean for geography tests is 
1.0 mark interval. 

All history tests. Similar facts for all history tests are given in 
Table XXXIV. The distributions for history and geography are very 
similar. For example, in geography 64.8 per cent of the pupils va- 
ried a minimum of one mark interval; the corresponding figure for 
history is 66.8. The mean for history tests is somewhat higher, due 
to a slightly greater proportion of cases in the two- and three-interval 
arrays. 


TaBLE XXXIV. Att History Tests—SumMMaAry DiIstTRIBUTION OF 
TEACHER-MarK DISPARITY 











Percentage Cumulative 

Mark Frequency- Cumulative Distribution Percentage 

Interval All Tests Frequency All Tests Distribution 
NOON sieve is eraincedeis 135 135 9.1 9.1 
Pm etaacss sicko trate 6cas ores 359 494 24.1 S32 
MPM elatetole oa) aj0. ca a ee 501 995 33.6 66.8 
Des ersteteicts fai nece sas ati « 495 1490 33.2 100.0 

IN tiers matevere curso 1490 100.0 





SUMMARY FOR ALL TESTS 


The summary distribution given in Table XX XV includes all of 
the pupils for whom teacher marks were available. The data are 
based upon 3,012 cases (sixty-five test groups). 


TABLE XXXV. ALL Tests—SUMMARY DISTRIBUTION OF TEACHER-MARK 








DIsPARITY 
Percentage Cumulative 
Mark Frequency-— Cumulative Distribution Percentage 
Interval All Tests Frequency All Tests Distribution 
SSMS Ie 252 chasis) site. 3ic 233 233 Tah ded 
Aayaa\=)ciaisievetsiels| sis /s'2 687 920 22.8 30.5 
BAIR Tes, aioys(cie\ae/sonelo'e 1079 1999 35.8 66.3 
Gent anatises oc 1013 3012 33.7 100.0 
BM eretatsh a tstelataracctete.ccs 3012 100.0 





Table XXXV shows that 233 pupils, or 7.7 per cent, varied from 
one test to another to the extent of three mark intervals. About 


64 Variability in Results from New-Type Achievement Tests 


one-fifth of the pupils showed a disparity of two mark intervals; 
approximately one-third of the pupils one mark interval; and one- 
third received identical marks. The cumulative percentage column 
in Table XXXV shows the facts very clearly. Almost one-third 
(30.5 per cent) of the pupils varied two or more mark intervals, and 
two-thirds of the group (66.3 per cent) varied a minimum of one 
mark interval. 


A summary of the mean teacher-mark disparity found appears in 
Table XXXVI. 


Tasle XXXVI. Att Tests—SuMMARY or MEANS oF TEACHER-MARK 








DIsPARITY 
Mean Teacher- Number of 
Subject Grade Mark Disparity Mean N ‘Test Groups 
Geography sc.cstaatesnrsene ne 5 0.98 61,2 10 
Geographyniieamcs clematis 6 0.87 64.0 7 
Geographyiicavesnstusuiaciamcieer 7 Ls 36.2 10 
LIB torvecrererccisiieiscine rs cette 5 1.08 47.0 8 
PUG tony: es eyave s.c/eti eve Craton ate 6 1.02 S152 8 
History ensreicee onic caer earns 7 1.20 3522 13 
FIG tOry.e  rsetirctiaeccetie SHS 1.12 30.9 8 
Plealth’s savciteccartecsjeietnres 6 1.03 100.0 1 
Mean’ of means’. «)ivar,-1, sense cayenne eeerererne 1.06 S352 





Table XXXVI gives the mean teacher-mark disparity for subject- 
grade groups. The highest of the means is 1.20 and the lowest is 
0.87, a range of 0.33. As a whole, however, these means center very 
closely about the central tendency, 1.06. Note that six of the means 
fall between 0.98 and 1.15. 

Concluding statement. The facts presented in Tables XXXV 
and XXXVI warrant the following conclusions: when teacher marks 
are assigned to pupils on the basis of two objective tests constructed 
by competent teachers to measure knowledge of the same subject mat- 


ter the two marks will vary an average of approximately one mark 
interval. 


(b.) COMMERCIAL STANDARDIZED TESTS 


CHAPTER. VIT 


DESCRIPTION OF PROCEDURES USED IN STUDY 
OF STANDARDIZED TESTS 


The purpose of this part of the investigation was the same as 
that of the teacher-made test study, namely, to determine the extent 
of disparity between tests which were designed to measure the 
same abilities. In this chapter the procedures which pertain to the 
study of standardized commercial tests are described. 

Selection of tests. In order to ascertain the disparity between 
comparable commercial tests, it was necessary to administer two or 
more tests to the same children. Although the results are analyzed 
in terms of specific subject tests and not in terms of batteries, in 
order to simplify administration the sub-tests of certain batteries 
were chosen for investigation. 

Three of the better known and more complete batteries were 
chosen for study, namely, The New Stanford, Advanced Examina- 
tion, Form W; The Metropolitan Achievement Tests, Intermediate 
Battery—Complete, Form A; and The Public School Achievement 
Test, Batteries A, B, and C, Form 3. In order to facilitate descrip- 
tion the three batteries are given the following designation: Metro- 
politan, a; Public School, b; New Stanford, c. Thus, “Reading a” 
is the reading test from the Metropolitan battery. 

Table XXXVII shows the number and type of sub-tests in each 
battery. 

As shown in Table XX XVII, three comparisons were possible in 
each of seven subjects, and one comparison could be made in each 
of the remaining three subjects. There was a possible total then of 
twenty-four subject-test comparisons. 

Administration of tests. The tests were administered in the sixth 
grades of seven public schools in Durham, North Carolina. Four 
hundred and sixty pupils took the three tests. 

The three batteries were given as a final examination on a semes- 
ter of school work. They were administered during the regular ex- 
amination period, and were considered by the pupils as a measure of 
their progress. This condition provided a relatively strong motiva- 


66 Variability in Results from New-T ype Achievement Tests 


Taste XXXVII. Sus-Test Composition or MetropotirANn, New STANForD, 
AND PusLic SCHOOL ACHIEVEMENT TESTS 











BaTTery 
Number of 
Subject Comparisons 
a b c Possible 

Reading wmacpacaenineo niacin . id x 3 
WoceBularyines sic vnsery cus shinisins “ i ° 1 
Arithmetic: Reasoning.......... " - ¥ 3 
Computation...... - * » 3 
Baglish DARREs he'll wivinie Blears * = * 3 
Literature soci cure cs iioeeee ; : as * 1 
PLSGOr yi sists paisinislseresseseiinpieta am ' 5 3 
Geogrio hy cies accent cramnttiee ; * x : 3 
Spellingi cnc: shrcsereiiedet.s ners - z * 3 
Peal th st: isnicieseccsteiecergin eps Mews : * 1 
Mortallive 5% sipiae mereinteseoketas 9 8 10 24 








*Indicates that there is a sub-test. 


tion. Three days were set aside by the school system during which 
the sixth grade children were freed from all other routine school 
activities. The pupils were permitted to go home at the end of 
the testing day. The children manifested a keen interest in the tests 
throughout the testing period. In fact, they seemed, in the main 
to enjoy the taking of these tests, and in many cases overtly ex- 
pressed regret that it was necessary to return to the school routine. 

In order to minimize fatigue and practice effects, three admin- 
istrative precautions were taken. First, the schools were paired ac- 
cording to the general type of children in attendance. A school hav- 
ing a large number of slow or retarded children in attendance was 
paired with a school having a relatively large number of “bright” 
or accelerated pupils to make an “administrative group.” This 
guaranteed that each administrative group would consist of repre- 
sentative children. In the discussion which follows the administrative 
groups have the following designation: Group I (school 1 and school 
7); Group II (school 5, school 2, and school 3); and Group III 
(school 4 and school 6). 

As a second precaution, a method of rotation was used in ad- 
ministering the tests. The pertinent facts concerning this method 
are presented in Table XX XVIII. 

Hence, if any position on the program was favorable the effect 
should have been neutralized when the three groups were considered 
as a whole. 

The third precaution pertained to sequence of sub-tests. As has 


Procedures Used in Study of Standardized Tests 67 


TaBL—E XXXVIII. MetHop or Rotation Usep 1x ADMINISTERING 
STANDARDIZED TESTS 
eee ee ea 


iw ; Date Tests Were ApMINISTERED 
Administrative Group 











January 14 January 15 January 16 
MGSrOUD) EMM sai harainaite clerics eis colsivicle Metropolitan New Stanford Public School 
RGrOl Mater ter leisteiste sisinetsisvise'sicls New Stanford Public School Metropolitan 
RSrOUD PUM tate mits weienicrcinianc are b-a\s Public School Metropolitan New Stanford 





been stated the sub-tests were considered as separate subject? tests. 
Since the results from these tests were to be compared, it was nec- 
essary that the individual sub-tests in a given subject field be given 
in a manner as nearly as possible identical. In order to effect this 
condition, the sub-tests in each battery were given in the same se- 
quence, and consequently the tests to be compared came at approx- 
imately the same time of day. For example, in every case the 
Arithmetic Computation test was given as the third test, appearing 
at the beginning of the second sitting. Thus, if there was any ad- 
vantage or disadvantage in this particular place in the day’s pro- 
gram, the effect should have been the same for each of the three 
tests to be compared.? 

The children were given brief rest periods between sub-tests and 
were allowed a play period in the open air between the four major 
sittings. 

The tests were administered by persons who were trained and 
experienced in the giving of standardized tests. Special emphasis 
was placed upon an exact adherence to the manual of instructions. 
In order to insure proper administration, a special conference was 
held with each person who assisted with the administration of the 
tests. In this conference the purpose of the study was carefully ex- 
plained, and the manuals and tests were examined in detail. The 
point that the same person administered all the tests to a given 

* Many of the sub-tests of the batteries used in this investigation are sold as 
separate tests. ; 

*One should keep in mind that the purpose of this study was to compare 
the results from standardized commercial tests. Whether or not the conditions 
were most favorable for excellence of performance is not of fundamental sig- 
nificance for this study provided the conditions were the same for each of the 
tests to be compared. The criticism may be made that the testing program 
was too strenuous, but neither facts nor observation seems to. support this con- 
tention. For example, the facts seem to indicate that the relationships be- 
tween the scores on the sub-tests of different batteries were relatively un- 
changed regardless of the sequence in which the batteries were given. How- 


ever, this point does not affect the validity, for the purposes of this study, of 
the rotation method used. 


68 Variability in Results from New-Type Achievement Tests 


group of pupils is worthy of note, for this offset any advantage or 
disadvantage that might have accrued from the personality of a 
particular individual tester. 

Scoring the tests. The usual precautions were taken to guarantee 
accuracy in the scoring of the tests. The factors given special con- 
sideration are listed here. (1) A central scoring place was used in 
order to insure complete uniformity. (2) All persons who assisted 
with the scoring had experience in scoring standardized tests. (3) 
The whole scoring procedure was carried on under the constant 
supervision of the investigator. (4) All scoring was done in strict 
accordance with keys and manuals. (5) In order that groups of 
papers revealing material errors might be re-scored, samples of 
each group’ of papers were carefully examined. 

All raw scores were transmuted into grade-equivalent norms which 
are furnished by the publishers of the tests. In every case this trans- 
mutation was performed in strict accordance with manual directions. 
These grade equivalents are calculated in terms of one-tenths of a 
school year. Thus a grade equivalent of 6.4, although often in- 
terpreted as representing achievement expected at the end of the 
fourth month of the sixth grade, may be more technically considered 
as representing six and four-tenths years of school work. The pro- 
cedures relative to the calculation of grade equivalents are uniform 
for all the tests. 

Nature and treatment of results. When the grade equivalents 
for any given tests (for example, the three reading tests) had been 
determined, the three measures of each child’s ability were expressed 
in comparable terms. A given child, 1, had grade equivalents as fol- 
lows in reading : 

Ra 6.0 


Rb 85 
Re 7.4 


The problem of Part B of the study was to determine the extent 
of variation or disparity between the performance of pupils on com- 
parable standardized commercial tests. Two methods of analysis 
were used to show the extent of disparity. First, the scores from 
paired tests were correlated (Chapter VIII); second, the difference 
or disparity in test results was calculated in terms of months or 
one-tenths of a school year (Chapter IX). 


® Those papers scored by a given scorer, or those sub-tests scored by a given 
individual. 


CHAPTER VIII 


CorRELATION ANALYSIS OF STANDARDIZED 
CoMMERCIAL Test RESULTS 


A correlation coefficient may be interpreted in several ways, but 
perhaps the simplest and most meaningful interpretation for the 
present purpose is that which states that the square of the coefficient 
indicates the percentage of factors common to the correlated va- 
riables.1_ Obviously, a correlation of 1.00 would mean that 100 per 
cent of the factors producing the scores were common. By the 
same method, it is clear that a coefficient of .50: (.25 when squared ) 
would indicate the presence of 25 per cent of overlapping or identical 
factors. 

A correlation table will further clarify the meaning of the co- 
efficients.2 Table XXXIX shows the distribution of cases when r 
is .68. 

The cases which appear in a given interval on one test are dis- 
tributed widely on the other test. As an example, note the distribu- 
tion of the sixty cases in the interval 6.0-6.3 on Test a (Table 
XXXIX). These cases are distributed on Test b in intervals rang- 
ing from 4.8-5.1 to 9.6-9.9. Only eight of the sixty cases appear in 
the 6.0-6.3 interval on Test b. Similar analysis of other arrays 
enables one to secure a clearer understanding of the individual va- 
riation involved in a correlation of .68. 

Correlations for all subject tests. If two standardized commer- 
cial tests constructed by experts to measure the same ability are given 
to the same children under comparable conditions, what will be the 
relationship between the two sets of scores secured? The answer 
to this question for the sub-tests of the three standardized tests is 
given in Table XL. 

One should note that the subject tests are arranged alphabetically 
in Table XL. The median of the coefficients is .68. The correspond- 
ing figure for teacher-made tests is .54. Thus, although standardized 
commercial tests show somewhat less disparity than teacher-made 

* This is a widely used interpretation. See Henry E. Garrett, Statistics in 
Psychology and Education (New York: Longmans, Green, and Company, 
1926), pp. 291-298. 

*For a valuable discussion of this type of analysis see Frank Sandon, “The 


Necessary Imperfections of An Examination,” The British Journal of Educa- 
tional Psychology, V (June, 1935), 191-192. 


70 ~=Variability in Results from New-Type Achievement Tests 


TABLE XXXIX. CorreLation TABLE—READING TESTS a AND b 

















Test b 
4.0-/4.4-]4.8-]5.2-|5.6-16.0-16.4-|6.8-]7.2-|7.6-|8.0-18.4-/8.8-|9.2-19.6-| F 
4.3 |4.7 |5.1 |5.5 15.9 16.3 |6.7 |7.1 |7.5 |7.9 |8.3 |8.7 9.1 19.5 19.9 
10,0-10.3....<)ecssfeseeleess]ense| eee ofiececllle acelee ail salves on] euecee Wiest etiiten itaian ni in 
9.6= 9.9. cccslocselecce|eese|oelse|lces o[ onraleeacl oils «alle» eal eerste aimee tata att aati i 
9.2- DiS. cccslaesclecealesso|oeee| os elacer clea celleee nls «ies call keen Meal tat ttt i i 
BiB D.Dicces[ee cell word's geefe’e ss fester |i ee blll nell apie ate mll ome | Metin nen | aan 2 ae 
BAS £8 07. devs [ke pcecell ronsrl yavsesl costa] ct | eevee | ts ee teen I cee | Ree 1 |. as)h elulioeetea eas 
BLO= BES ie sere elle ste al sveretallltere el tetera arene | tecee | eee | ee 1 3) 3°| 27) 7 gongs 
FT Gct TD) ate reed evel terete omic i] elie 9} 1) 6) 5) Da oa ion ean 
Sl Fede s5 mere al oe tael leer eres 1| 1) 2] 6, 10\|; 4 |.17 | teuleseeieeon ieee ene 
PUM Gre=i7ilbecalnata lee ial crest 142] 5) 5) 2 | 04] a) On ene ee 
a set ane Astral lar oel laser ioc B92) aG))|| zal ean ae 2)| 0 Daler ass 
CANE Ge sntte| oer Mae 1| 4) 5) 8112) 5) 7) 2) So) oh eee eed 
5 26-515)9 se sell Neel steel eee 8) 71 8| 8) 5] 9 3 | Shiga ee 1? | 55 
SAD 15 Se see 1] 1] 3} 11] 6) 5) 4] 2) 9 2) 3a en rere ee 48 
AU pense cela 2.) Ail! 4s Bye aie giti teh eine a ee 2 1 45 
4.4= 407.505. 3 Dill a5 lsat ee 1 Se SEA Pac sccdsos. 1 13 
F 6| 5] 10] 39 | 26 | 44 | 40 | 32 | 78 | 20} 58] 41 | 14] 17] 30].... 
N 460 
r -675 





objective tests, it is clear that a large degree of variability remains 
in the reputedly more refined standardized tests. 

As indicated by Table XL, the geography tests b and c show the 
highest correlation. The lowest coefficient is that for health tests 
(.496). The relatively high correlation between Tests b and c in 
geography is misleading in a sense, for its unusual size was due to 
a relatively small number of cases—about 25—-who made the highest 
possible score on both tests. If these cases are omitted, the correla- 
tion becomes approximately the same as that found for the other 
geography tests, namely, about .65. 

The column entitled r? in Table XL may be considered as one 
index to the disparity in results from the tests involved. If the 
correlation is .70 or below, less than 50 per cent of the factors which 
produced the scores may be said to be identical. Fourteen of the 
twenty-four correlations come in this group. 

Closely related studies. About the time the investigation just 
described was completed, Foran and Loyes*® reported a very similar 

3T. G, Foran and Sister M. Edmund Loyes, “The Relative Difficulty of 


Three Achievement Examinations,” The Journal of Educational Psychology, 
XXVI (March, 1935), 218-222. 


Standardized Commercial Test Results 71 


TABLE XL. CoRRELATIONS BETWEEN STANDARDIZED ACHIEVEMENT TESTS WHICH 
WerE DESIGNED TO MEASURE THE SAME ABILITIES 











N 460 
r 2)(Per Cent*of 1] 
Subjects Test Correlated* r Common Factors) 
Arithmetic computation............... aandb .749 -56 
Arithmetic computation .............. bande -696 -48 
Arithmetic computation............... aandc .708 -50 
Arithmetic reasoning.................- aandb .708 -50 
Arithmetic reasoning.................. bandc 713 = 
Arithmetic reasoning.................. aandc 635 .40 
BE PUR LAD UV ee coin oes ale sa/%e% aie re) oreravehs aandb .664 44 
NGecrpran hater serio ieee bslees s sisies bandc -906 -82 
RGPCp rap Memes ak araicelerata’s cis ee circ << aandc .576 333 
ABA Taren tT ete fara fat skate Ties dera'd, ares) 500? band c -496 oo 
PALO NVC Ee see tjeraic-te Ssieie.eros 6 «nicks aandb .590 35 
Pin teary eprint cieleletele'sicloicio.c cits leis c'essis bandc -643 41 
ERSORU tat foe rieisiersie ls roreyre fs ala sis s ole:s aandc 565 aoe 
MIP NA RO UGA REN: ee lav. aoicesioe/oteatels a's s aandb .647 -42 
MOR DIIA RE AR REsotcotec, «s/o \cae eile es «clare bandc -649 -42 
Mpa MPURRGMER RE eis,<.alevenpsisle-sieiase Saisie ao aandc -645 -42 
incera tne enseey <rcraic sore Adie cates cikse.« aandc -588 34 
ROR CIID PRT och ic cy obatiin e/encicicn oie Meese aand b 675 -46 
Rea pone iertefen ccs cisivinteis e lclesilewiaie « bandc -705 -50 
RUSH OLN DPT cas ciereaietein' so e.aicieislos icles aandc .730 53 
UOTE A INI AnyaraJafn)ats\ Nats e:olcivind oe <6: - aandc -790 -62 
BICLUIN peters cciat, lore eee cielo acts axe Sie aandb 863 .74 
BOPTENLINa repatatette ore ateraisecless ots oY tase fice sxssecs bandc -848 .72 
Byrse limp epeetererectaye ects, fs ieiale 3's catecasbio'e aandc -844 -71 
Median Meenas irae | shace 68 





*Test a Metropolitan, b Public School, c New Stanford. 


study. The tests used were: (1) The New Stanford, Advanced 
Examination, Form V; (2) The Modern School Achievement, Test 
I; and (3) The Unit Attainment Scale, Form A, Division 2. The 
results from the three tests were correlated. The coefficients found 
are presented in Table XLI. 
_ The median of the twenty-three coefficients in Table XLI is .60. 
The median coefficient for the Durham study was .68. Foran’s 
and Loyes’s facts further illuminate the problem of this investiga- 
tion. If the data from the two studies are combined, some tentative 
generalizations concerning the relationship between standardized 
tests in certain subjects at the grammar grade level may be ventured. 
The mean correlation for the various subject tests (based upon co- 
efficients from both studies) is shown in Table XLII. 

The mean for all subject tests is .633. The median of the forty- 
seven coefficients is .647. Note the mean for specific subjects as 
shown in Table XLII. Arithmetic computation, for example, shows 


72 Variability in Results from New-Type Achievement Tests 


TABLE XLI. CorrELATIONS FROM ForAN’s AND LoyeEs’s Stupy oF THREE 
STANDARDIZED TrEsts* 














Subject Tests Correlated** r 

Arithmetic computation: «:..csls accu alr rier Unit and M.S. 564 
Unit and N.S. 605 

M.S. and N.S. .709 

Arithmetic reseonings 6.0. ccelneiee sea ieee tice Unit and M.S. -603 
Unit and N.S. -592 

M.S. and N.S. 611 

Geography. ite: iil antacrev.scsneee bac iets neler Unit and MLS. .661 
Unit and N.S. -678 

M.S. and N.S. .898 

BX eal this. cle cbs ctintes ate ote nts erratic tere tea edna M.S. and N.S. -440 
Fl istorye ost sieearatic ais fetelaan Hie sla the ote alee eee ‘Unit and M.S. 514 
Unit and N.S. PE? 

M.S’ and N.S. -545 

Tanguape ueage sas, as'iscicleel-lvcoctdoee hair telh cece Unit and M.S. -428 
Unit and N.S. .239 

M.S. and N.S. 455 

Tiiteratures 2 )iis, 51:4) «1010 dioica ee rae ec c5e Caen Unit and N.S. +42 
Reading—ParagraphiMny aan sects ane enteeleie Unit and M.S. 543 
Unit and N.S. -636 

M.S. and N.S. shoe 

Spelling it cas ddd Soe eee ee Unit and M.S. .670 
Unit and N.S. -685 

M.S. and N.S. 745 

Median’. < s\< 0s .0.5 esiiec ners Laeraere ena | Meneame rie -603 


*Data adapted from T. G. Foran and Sister M. Edmund Loyes, op. cit. 
**Unit—Unit Attainment Scale, Form A, Division 2. 
M.S.—Modern School Achievement, Test I. 
N.S.—New Stanford, Advanced Examination, Form V. 


a mean correlation of .672, based upon the correlation of six pairs 
of standardized tests designed to measure the same ability. The 
mean for other subjects may be read from Table XLII. 

Taste XLII. Mean CorRELATIONS FOR STANDARDIZED TESTS IN VARIOUS 


Supjects BASED UPON DATA FROM FoRAN AND LOYES AND 
UPON DATA FROM THE PRESENT STUDY 


Number r’s in 





Subject Mean r Range of r’s Subject Goup 

Arithmetic computation............... .672 -564 — .749 6 
Arithmetic reasoning................-. -644 .592 — .713 6 
Geographyavc niece corte risen ceictas .730 -576 — .906 6 
Health ti: sctccce piycwide ices s sretelttrarte one -468 -440 — .496 Zz 
His tonyes iis tecttseisecareteetn ere roe ee 571 .514 — .643 6 
Language usages \-sarecveyeyecctsrocnssiola sats alot -510 .239 — .649 6 
Eiteratures ac cirectec ssa aarti aces -500 -412 — .588 2 
Readings)... hawks eed serene 669 .543 — .730 6 
Spelling 53% sis cence ccaatece oes -776 -670 — .863 6 

1 


Word meaning: 2.000 ccceneeeee BIO! MP ee faite veer Sievera 6 


IN eyecopehere eis fereie leet tek eisai enero sacteg A Toe eestor 47 


Meaniof' meanasa. > .ocssaaeeoteeer eee Z638KL. 8 ee alee x 


Standardized Commerctal Test Results 73 


Ruch and others* correlated the scores on ten American History 
tests. The mean correlation between a given history test and nine 


other history tests (for each of the ten tests studied) is shown in 
Table XLIII. 


TasLe XLIII. MEAN CorrELATIONS BETWEEN AMERICAN History TEsTS 
(STANDARDIZED) BASED upon Data rrom RucH AND OTHERS 





Mean r With Other 














Test Form Nine Tests 

MPC LCR ON EAs ret rete ee ce) eis ts 2 cyoy iar aia aiaiatese a/a/mloee bibverareraaieveraiace .66 
Dia TC ROT YAS erates aia ia; <cescie<'s)51 STM erePe tae vata Sra ieee ie she 63 
SMM EVEL LISA Neer reara yas AN arciate eters vista cisiaresereceinovid's.e senior g es 54 
AMER arr BRP SE Nese ne Secuan sine Pea rein intone 61 
SPER ELEUHO YAR UACIRICHATOS: \asclecec coe eisinic.s viv Scireim adie sioiesreentes .67 
COMMU ANION VAR ENEII SOM te io fetcs, Sarees cia.cicisle indo om tele rotnih oneee ak .60 
TAMMIE DTIC IN AN epee aye ects oc colo ehcceie s ajetcvaiesaistereslareye nicielevene teres -67 
RMN ETS CLM RMT MPT VAM cy cents a anor us Pvepetctere ATG oiasa elolive eieiomed oat’ 60 
9. Van Wagenen History Reasoning A.............-.-.---eeeee 48 
10. Van Wagenen History Reasoning B................2.00000e -48 
Wieanto iments pert foyie  (<;e/a/a alas ote eile a sileleieiecaene 59 
Medtamiotimeangianisictd tier emisicmivsioae sfeseeatice neato 60 





*Data adapted from Henry L. Smith and Wendell W. Wright, Tests and Measurements (New York: 
Silver, Burdett and Company, 1928), p. 242. 


The median of these means is .60 and the mean is .59. These fig- 
ures are very similar in magnitude to corresponding figures already 
presented. 

Concluding statement. Data available warrant the following con- 
clusion: When two standardized achievement tests constructed and 
standardized by experts for the purpose of measuring the same 
abilities are given to pupils and the scores correlated, the median 
correlation tends to be approximately .65. This means that about 
42 per cent of the factors in the testing, situation are identical; or 
stated in other terms, that about 58 per cent of the factors which pro- 
duce the scores are different. 

*G. M. Ruch, M. H. De Graff, W. E. Gordon... (and others), Objective 


Examination Methods in the Social Studies (New York: Scott, Foresman and 
Company, 1926). 


CHAPTER IX 


ANALYSIS OF STANDARDIZED TEST FINDINGS IN 
TERMS OF GRADE-EQUIVALENT DISPARITY 


A coefficient of correlation does not reveal the extent and nature 
of individual variations between the scores on two tests. Therefore, 
in order to show more clearly the disparity between the standardized 
objective tests used in this study, the findings are analyzed in this 
chapter in terms of grade-equivalent scores—commonly expressed 
in years and months. 

A brief illustration will serve to clarify the basic data on which 
the tables presented in this chapter are based. The following are 
the grade-equivalent scores of three pupils on the three standardized 
tests in reading (paragraph meaning). 


Pupil Test a Test b Test c 
Let talaiey srahier th alcta leeks Tez 9.0 8.7 
Dist sively alstartiaraieie 7.4 8.2 7.0 
SREP ends cts 6.9 6.0 7.6 


The disparity between Tests a and b was calculated in the following 
manner. Test b was considered as a base! and the disparity be- 
tween the two tests was the number of months that the Test a scores 
varied from the Test b scores. Whether the variation was above or 
below the base score was not taken into consideration. For pupil 
1 (in the illustration) the variation (a from b) was eighteen months 
(9.0-7.2). The tables which follow contain distributions of these 
individual variations for all subject tests given. 


GRADE-EQUIVALENT DISPARITY—ALL SUBJECTS 


Arithmetic computation. The grade-equivalent disparity for the 
three arithmetic computation tests is shown in Table XLIV. 

Table XLIV indicates that the mean amount of disparity between 
Tests a and b in arithmetic computation is 8.4 months. Approxi- 
mately 35 per cent of the pupils had grade-equivalent scores on Test a 
which were one school grade (ten months) or more from their 
grade position as determined by Test b. The mean disparity between 
Tests a and c is almost two school grades (18.5 months). About 80 

*In the case of a given comparison, the test to be used as a base was chosen 


arbitrarily. The choice in no way affects the findings, for the variation is sim- 
ply the distance in months from one score to another. 


Findings in Terms of Grade-Equivalent Disparity 


75 


Taste XLIV. ArITHMETIC COMPUTATION—GRADE-EQUIVALENT DISPARITY 
BETWEEN TEsTS a, b, AND c IN TERMS oF MonTHS 





Test a From Test b 


Test a From Test ¢ 


Test 6 From Test c 


Cumulative Cumulative Cumulative 
Months of Number of | Percentage | Number of | Percentage | Number of | Percentage 
Disparity Pupils Distribution Pupils Distribution Pupils Distribution 
AD AA ere viclast\aiaio’s 1 0.22 sane 
foo ea eae 2 0.65 1 0.22 
BO RSS witinfen sf a1aiave 8 2.39 2 0.65 
BB Sara Misinaleiicse/ os Pa 11 4.78 3 1.30 
BOSS 2 ew eisyaieieisravesers 1 0.22 25 10.22 6 2.61 
aerate ree clei 3 0.87 30 16.74 20 6.96 
BAL alot ticnici 3 1.52 52 28.04 17 10.66 
2 Btertsicatsta)s sare 5 2.61 57 40.43 26 16.31 
E20 Tee tiseaesiee 20 6.96 58 53.04 45 26.09 
DOU aaa torele'sesetie 32 13.92 62 66.52 41 35.00 
MAMA ate pelate sche xaos 42 23.05 45 76.30 58 47.61 
ae eclelecreteca ales 83 41.09 35 83.91 68 62.39 
CaO re eite <tees 80 58.48 25 89.35 62 75.87 
BSc e tise ss (sie aie 104 81.09 31 96.09 69 90.87 
ORD rraitsintaring 87 100.00 18 100.00 42 100.00 
pUGtalhe art ssc mies 460 460 460 
IMlean ica rayva sites 8.4 18.5 12.8 


per cent of the pupils varied one grade or more. Note further that 
almost 10 per cent of the pupils were thirty months or more apart 
on the two tests. This means that these pupils had grade-equivalent 
scores on two standardized tests such as the following: Test a, 
5.0, Test b, 8.0; Test a, 7.0, Test b, 4.0. The full meaning of this 
variation is clearer if one recalls that the lowest possible grade- 
equivalent is 0 and the highest in this case is 10.0; that these pupils 
were all in the sixth grade; and that both of the tests were designed 
to measure arithmetic computation ability. 

The mean variation of Test b from Test c was approximately 
one- and three-tenths school grades (12.8 months). Nearly 20 per 
cent of the groups showed a minimum disparity of two grades, and 
more than 50 per cent a minimum of one school grade. 

The mean of the three arithmetic computation means is 13.2 
months or about one and three-tenths school grades. On the av- 
erage, then, these three standardized tests showed a disagreement 
more than one school grade in respect to the arithmetic computation 
achievement of the 460 pupils. 

Other subject tests. The data which pertain to grade-equivalent 
disparity or variation for standardized tests in arithmetic reasoning, 
geography, health, history, language usage, literature, reading, word 
meaning, and spelling are presented in Tables XLV* to LIII*, in- 


76 ~~‘ Variability in Results from New-Type Achievement Tests 


clusive. The facts in these tables are similar in kind to those given 
and analyzed for arithmetic computation. The mean amount, the 
range, and the distribution of the disparity found in the case of par- 
ticular subjects may be read from the respective tables. 

Summary of grade-equivalent data. A summary of the mean 
grade-equivalent disparity for all subject tests is given in Table LIV. 


Taste LIV. SuMMArY oF MEAN GraApE-EQuIVALENT DISPARITY FOR 
ALi Suspyect TEsTs 





Mean of means — 





Subject Disparity in Months 
Anthmeticicomputations:.<sctnstec eee ae oe econ ETS 13.2 
Arithmeticireasoningy 11; civ. cpeeie oo serene aia. eer ee TB 
Geography’, «otc inkin tic stele slain slates cheval aes oe ee 11.0 
Health: fection veniielcciv'vla kee else bien oe cae One eee 1241 
FLISGOLY: 6,8 ieinvtioseie > seleehalelemopiioesainnas in Gee Cae fee It 
TaN GWA ge eset: sycte.ioieiare v1 viareiayersucte\crotem ister le sieraperec ee RTO ce 1259) 
Titera ture’). siFaleysece Sic evista acysvobmeunters Saeicr en nme RC ae ee 10.7 
Reading—Paragraph ‘meaning: \<\s.0scniecicnccecke pee me mene Leen eee 10.0 
Spellings Hoes icc sre so00.6 0 « sereree sotto ers els CIM CE eee Bio: 
Word meaning 235 5 sid tees es ieiayaisie anak care te tata Oe eee ere Up? 


With the exception of three subjects, the means in Table LIV 
are very consistent. The spelling tests clearly showed less varia- 
tion than any of the other tests given. The means for arithmetic 
reasoning and word meaning are significantly lower than the mean 
of the means. It is of interest, however, to note that even in the 
case of spelling the mean disparity is somewhat more than one-half 
of a school grade. The mean disparity for seven of the ten subject 
fields ranges from one school grade (10 months) to one and three- 
tenths school grades (13.2 months). 

The mean of the means, including spelling, is 10.2 months or 
approximately one school grade. This fact signifies that if the 460 
children were classified by one of the standardized tests and then 
by a second or a third supposedly comparable test, on the average 
the individual children would be classified on the second or third 
test one school grade or ten months from their positions on the first 
test. Thus, if a pupil made a grade-equivalent score of 6.4 on the 
New Stanford arithmetic computation test the findings of this study 
indicate that his score on either of the other two arithmetic computa- 
tion tests would, in general, be either 5.4 or 7.4. 

Concluding statement. In terms of grade-equivalent scores the 
mean amount of disparity between two standardized objective tests 


Findings in Terms of Grade-Equivalent Disparity 77 


constructed by experts to measure the same ability, administered in 
comparable fashion, and scored objectively was about one school 
grade (ten months). The conclusion may be drawn, therefore, that 
elements which produce variation enter into the testing situation in 
such degree as to cause an average of one school grade of disparity 
between the results from the two measuring instruments. 

The extent of the disparity found between test results may be 
clarified by an illustration from the physical sciences. For exam- 
ple, what would the amount of disparity found in the case of ob- 
jective tests mean in terms of weight? If the children averaged 
about sixty pounds and were distributed from thirty to one hun- 
dred pounds, there would be a mean difference between the weights 
of individual pupils as determined by two standard weighing instru- 
ments of approximately ten pounds. That is, if a given child 
weighed seventy pounds on one set of scales, and the disparity were 
the\same as that found for tests, on the average, he would weigh 
sixty or eighty pounds on the second set of scales. 


I) ; iL = | 


CHAPTER X 


SUMMARY AND CONCLUSION 


SUMMARY 


4 This investigation dealt with new-type or objective achieve- 
ment tests which were constructed to measure the same abilities and 
which were administered under comparable conditions. The pur- 
pose of the study was to determine the extent of disparity between 
the results from these tests. Both teacher-made and standardized 
objective tests were studied. 

#@. Sixty-three informal or teacher-made objective tests were con- 
structed by thirty-five teachers. The subjects covered were fifth, 
sixth, and seventh grade geography; fifth, sixth, and seventh grade 
and high-school history; and sixth grade health. The tests were 
matched in such a manner that two tests which were constructed 
to measure pupil acquaintance with the same body of subject matter 
were considered as a “pair.” Paired tests were administered to 
sixty-eight groups of children. Forty-three groups took tests which 
covered relatively short units of identical text matter, and the re- 
maining twenty-five groups took semester examinations based upon 
practically identical text material. Disparity was measured in three 
ways. 

& When the pupil scores on paired informal tests were correlated 
the coefficients ranged from .845 to —.212. The median of the sixty- 
eight coefficients was .54,. The extent of relationship was slightly 
less for semester tests than for tests covering a shorter unit of work. 

b. A study was made of disparity in terms of the difference be- 
tween percentile rank positions on the two texts. The mean per- 
centile rank disparity found was 23 percentile points. The dif- 
ferences ranged from 0 to 99. About one-tenth of the pupils 
achieved ranks on one test which were 50 or more percentile points 
from their ranks on a second test; approximately one-fifth of the 
pupils varied in rank 40 or more percentile points; and finally, one- 
third of the pupils showed a minimum percentile rank difference of 
25 percentile points. 

e. In terms of teacher marks the disparity between paired in- 
formal objective tests was found to be approximately one mark in- 
terval. The greatest amount of difference possible was three mark 





Summary and Conclusion 79 


intervals. Approximately 8 per cent of the pupils varied three 
mark intervals (a mark of A on one test and a mark of F on a sec- 
ond); 23 per cent varied two mark intervals; and 36 per cent va- 
ried one mark interval. Two-thirds of the pupils received marks 
on paired tests one or more mark intervals apart. 

-3- Three standardized objective tests in each of seven subjects 
and two such tests in each of three subjects were administered to 
460 pupils in the sixth grade. The disparity between the results 
from comparable tests was found (a) in terms of correlation and 
by in terms of difference in grade-equivalent scores. 

-«, The correlation between results from standardized tests con- 
structed to measure the same abilities ranged from .906 to .496. The 
median of the twenty-four coefficients was .68. As a group the 
spelling tests showed distinctly higher relationship than did other 
subject tests. 

+. In terms of grade-equivalent scores the mean amount of dis- 
parity between two standardized tests constructed by experts to 
measure the same abilities, administered in comparable fashion and 
scored objectively was found to be about one school grade (ten 
months). The range of disparity was from zero to somewhat more 
than six school grades (sixty-two months). From about 1 to 44 
per cent (depending upon the subject) of the pupils varied two or 
more school grades. 

CONCLUSION 


The findings of this investigation permit certain tentative gen- 
eralizations which bear upon the characteristics of objective or new- 
type tests as these tests are customarily used. The validity of these 
generalizations is dependent upon the degree to which the findings 
revealed by this study are representative. The procedures used in 
the present investigation and the consistency of the findings are sub- 
mitted as evidence that the results of this study are reasonably re- 
liable. The generalizations follow. 

A. A test may be objective in the sense that all personal opinion 
is eliminated in scoring and still fail to remove important personal 
elements from the evaluation of pupil achievement. There are 
many factors (other than scoring) in the total measurement situa- 
tion which cause marked disparity between the results from two or 
more new-type or objective tests constructed to measure the same 
functions. . 

2> Measures of pupil achievement obtained from different infor- 
mal objective tests may be expected to vary to a considerable extent. 


A 


80 Variability in Results from New-Type Achievement Tests 


Thus if a pupil takes Teacher A’s test his score, rank, and mark may 
be very different from what his score, rank, and mark would have 
been had he taken Teacher’s B’s test. This condition is to be ex- 
pected even when the tests cover identical bodies of subject matter 
and are designed to measure the same achievement. The extent of dis- 
parity which, in general, may be expected has been expressed in the 
\_preceding summary as points a, b, and ¢ under 2, 

- . Pupil ratings based upon standardized test scores show marked 
disparity. Thus a grade-equivalent rating for a given child in a 
particular subject as determined by one standardized test may dif- 
fer significantly from his grade-equivalent rating as determined by 
a second standardized test. (Alton the differences will vary from 
subject to subject, in general} the disparity or difference may be ex- “ 
pected to be approximately the amount reported in the preceding 
summary (points a and 6b under 3). 


by 


Paki tir PRACTICAL IMPLICATIONS AND 
THEORETICAL PROBLEMS 


CHAPTER: XI 


EDUCATIONAL IMPLICATIONS AND PROBLEMS 
FOR FURTHER RESEARCH 


EDUCATIONAL IMPLICATIONS 


Objective tests (teacher-made and standardized) are widely used 
in the public schools. There has not been much evidence adduced 
to establish the extent to which such tests are reliable measuring in- 
struments when the processes involved in the whole measurement sit- 
uation are considered. In the absence of such evidence the tests 
have been uncritically accepted, and this practice has tended to fos- 
ter error in the interpretation of test results. An acquaintance with 
the limitations of objective tests should enable the tester to make al- 
lowances and to increase the comprehensiveness of his measurements. 

An example of the questionable manner in which standardized 
objective tests are frequently used will indicate the value of recog- 
nizing the limitations of such tests. The seventh grade pupils in the 
public schools of North Carolina for several years were given the 
New Stanford Achievement Test. The suggestion! was made that 
the grade equivalent 7.0 be taken as the minimum for promotion to 
the next grade. Now suppose that the State Department of Educa- 
tion had chosen the Public School Achievement Test instead of the 
test actually adopted. The facts revealed by this study indicate that 
in this case there would have been an average individual change in 
grade equivalence, for any given subject, of one school grade (ten 
months). For example, on the average, pupils who made a grade- 
equivalent score of 7.4 on a sub-test of the Stanford Achievement Test 
would have achieved a grade equivalent of 6.4 or 8.4 on the corre- 
sponding Public School Achievement Test. (It should be remembered 
in this connection that both of these tests are widely used, that both 
were devised and standardized by experts, that both were constructed 


1The grade equivalent suggested was tentative and was derived from local 
norms (state test scores), but the point here made would be the same re- 
gardless of the specific grade-equivalent score chosen for graduation. The 
issue is: How dependable is a given grade equivalent as determined by a 
particular test? 


82 Variability in Results from New-Type Achievement Tests 


to measure the same abilities, and that grade-equivalent standards 
were established in the same manner.) In problems of promotion and 
of grade placement, the extent and consequences of the disparity 
here described are obvious. It is important then that testers have 
the knowledge that an individual grade-equivalent score is affected 
in considerable degree by subjective factors entering into the con- 
struction and use of the particular test on which the score is based. 

The implications from the study of teacher-made tests are similar 
in nature. The findings indicate that a pupil’s performance on new- 
type tests, in spite of the fact that the tests are objective in respect 
to scoring, is relatively variable. It is of significant practical value 
for persons using the new-type test to know this fact. If scores 
from tests are thought to be free from the effect of personal judg- 
ment, decisions based on them may not be checked by other evi- 
dence (as would tend to be the case when the decisions rest ee) 
upon personal opinion). 

The facts here revealed, also, have certain implications for the 
theory or science of education. Since the appearance of Thorndike’s 
An Introduction to the Theory of Mental and Social Measurements 
in 1904 educationists have tended to maintain that exact or abso- 
lute measurements are a fundamental requisite to a science of edu- 
cation. It is of value to know the relative extent to which so-called 
objective tests satisfy this requisite. The assumption that a science 
is possible only when uniform, relatively unvarying measures are 
available may or may not be sound. If it is sound, then the facts 
here reported indicate that education can hardly base its claim 
to be a science upon the reliability of the commonly used objective 
test. ¢ 

Are new-type or objective tests more or less objective measur- 
ing instruments than are essay tests? The consensus of opinion on 
this point has been (if one may judge from the literature) that the 
new-type test is much more nearly free from the effects of personal 
factors. This opinion seems to have resulted from the fact that 
many writers have neglected to consider the complicated nature of 
the testing situation as a whole. Such evidence as is available indi- 

*cates that there is probably little difference between the two types of 
test in respect to the presence of elements which cause disparity in 
the test results. However, this point was incidental to the present 
study ; an adequate solution of this problem must await the accumula- 
tion of further evidence. 

Finally, the educator may ask, In the light of the present evi- 





Educational Implications and Problems 83 


dence, should new-type tests be used in the measurement of educa- 
tional progress? An answer to this query can be made only when 
the purpose of the tester is known. Problems such as (a) the type 
of function measured by the new-type test, (b) the extent to which 
the use of such tests promotes reflective thought or rote learning, 
(c) the degree to which new-type tests are an index to various 
types of mental content, and other similar problems demand for their 
solution data in addition to those presented in this investigation. 


PROBLEMS FOR FURTHER RESEARCH 


As was stated in Chapter I (pp. 13-15), the purpose of this study 
was to discover factual evidence bearing upon definitely limited 
aspects of the large general problem of educational measurement. 
Therefore, numerous important and closely related problems of 
measurement are outside the scope of this investigation. Their so- 
lution requires specific researches designed to discover pertinent 
evidence. In order to facilitate such research, three major prob- 
lems closely related to the present study are presented and discussed 
briefly. 

1. Causes of disparity. What are the sources or causes of the 
disparity found to exist in the construction and use of the new- 
type or objective test? The subjectivity in the testing situation when 
new-type tests are used probably results from one or more of the 
following twelve causes,? each of which involves the personal judg- 
ment of the tester. 

a. The items chosen. Teachers (or other test constructors) may 
and do choose different parts of given subject matter as suitable for 
test items. This difference in choice may be due to chance, to dif- 
ference in judgment as to the importance of particular subject mat- 
ter, to ease with which test items may be made, and perhaps to other 
factors. Further, it is doubtful whether the same teacher would 
select the same material for test items on two different occasions. 

This problem of selecting test items involves the intricate theory 
of sampling. One may contend that the particular items chosen to 
make up a test do not significantly affect the nature of the test, pro- 
vided the items are an adequate sample of the function tested. This 
contention seems to be based upon the assumption that one mental 
reaction is as likely to appear as another (quite as if one were sam- 
pling the apples in a barrel), and hence tends to ignore the possibility 
of mental organization. In the case of a particular individual, a spe- 


? The probable sources of disparity listed here are regarded as more or less 
testable hypotheses. 


84 = Variability in Results from New-T ype Achievement Tests 


- cific item may recall a group of emotionally toned experiences which 


tend to block cognitive functioning and thus affect his performance _ 


on all other items in the test. In such a case, the fact that this par- 
ticular item was chosen instead of numerous others that might have 
been chosen would have a significant effect upon the pupil’s achieve- 
ment rating. 

b. Manner in which test items are stated. This factor involves 
the language in which the item is couched, the length of the item, 
and related aspects. For example, two true-false items may be con- 
structed to test pupil acquaintance with a given bit of information, 
but a slight difference in language might make the difference be- 
tween a correct and an incorrect response in the case of an individ- 
ual child. In such a case, the personal decision of the tester to use 
a particular phrasing would be the cause of a change in the score 
of the pupil involved. 

The problem may be more technically stated in the following 
manner, A test item is in essence (a) a series of symbols (b) or- 
ganized into a unit (called a sentence) for the purpose of repre- 
senting an idea. Two external factors determine the idea which 
a given unit conveys: First, the particular symbols used, and sec- 
ond, the organization of the symbols. Both of these factors are sub- 
ject to wide variation from tester to tester. It follows, therefore, 
that when a given tester selects a specific set of symbols and or- 
ganizes them in a specific manner in order to produce a test item, 
in both choice and organization of symbols he has used his per- 
sonal judgment to a large degree. It follows, also, that the reaction 
of the testee to that item is conditioned in some degree by these 
subjective choices on the part of the tester. 

c. Type of test item. Ruch* lists seventeen types of objective 
test item. There are several varieties of many of these types. When 
a test is constructed the decision must be made as to what type, or 
types, of item are to be used. Two test items intended to measure 
exactly the same information may produce different responses if 
the items are different in type. It seems highly probable that a mul- 
tiple-choice test item measures a different function from that meas- 
ured by a completion item. For example, suppose two teachers in- 
dependently decide that a child in fifth grade geography should know 
what ocean a little Spanish boy would cross if he sailed directly to 
New York from his home in Spain. In order to determine whether 
or not the pupils have this information, each teacher constructs an 

8 Ruch, op. cit., p. 189. 


Educational Implications and Problems 85 


objective or new-type test item. The first teacher puts her item in 
the following form: A little boy coming directly to New York from 
his home in Spain would cross the (Pacific, Atlantic, Indian, Arctic) 
Ocean. (Correct response to be underlined.) The second teacher 
with exactly the same purpose in mind states her item in the fol- 
lowing manner: A little boy coming directly to New York from his 
home in Spain would cross the ............... Ocean. (Correct response 
to be written in the blank space.) Are the two items really alike? 
Or are they different? Certainly the two items may (perhaps do) 
occasion different mental processes.4 There is reason to believe that 
the difference in mental processes is sufficient to cause in many cases 
- variation in pupil response. 

d. Variety and proportion of types of item. A test may consist 
entirely of one type of item, or it may be composed of a number of 
types. There may be few or many items of any given type. A test 
made up of twenty multiple-choice and twenty true-false items may 
be quite different from a test which consists of five each of true- 
false, multiple-choice, completion, and matching items. The transi- 
tion from one type of mental activity to another may affect the per- 
formance on particular items or on the test as a whole. In con- 
structing a test the tester must inevitably exercise judgment as to 
the variety and proportion of items he will use. Differences in judg- 
ment at this point make for variation in responses. 

e. Grouping and other relational aspects of items. In a given 
test the items which are based upon closely related aspects of knowl- 
edge may be grouped together, or the items may appear in the test 
without regard to their content. Whether easy or difficult items are 
placed toward the beginning or later in the test is a simple illustra- 
tion of the problem here in question. If in the case of a given child 
the first five items on a test are very difficult, affective disturbance 
may influence the pupil’s performance on the remaining part of the 
test. On the other hand, the pupil who answers the first five items 
with ease and confidence may approach the remainder of the test 
with an attitude such as will promote effective work. Thus, the 
mere decision to place particular items in a given position may have 
an effect upon pupil performance on the test. 

The organization of the test items may, also, be of much im- 
portance. A test which requires a series of relatively isolated re- 

*Weidemann has called attention to this point. Charles C. Weidemann, 


How to Construct the True-False Examination (Contributions to Education, 
No. 225; New York: Teachers College, Columbia University, 1926). 


86 = Variability in Results from New-Type Achievement Tests 


sponses may yield different results than a test composed of the same 
items organized into meaningful groups and arranged in systematic 
sequence. For example, ten consecutive items testing the pupil’s 
knowledge of the industrial products of Germany may produce dif- 
ferent results than the same ten items would if scattered throughout 
fifty other items. In the first case, the pupil’s mental activity may 
tend to be organized or it may tend to be confused because of the 
proximity of many items on the same subject. Be that as it may, 
the results from a test may depend in some degree upon the tester’s 
decision as to the organization of the items in his test. 

f. Clarity and fullness of general and specific directions. Full 
explanation such as would tend to guarantee an understanding of the 
reaction desired may be given, or the directions may be brief to the 
extent of vagueness. Illustrations may or may not be given. The 
language in which directions are cast may vary in complexity and in 
clarity of expression. The pupils may, or may not, be permitted to 
ask questions before the test or during the test. It is possible that 
a single question asked by a student and answered by the teacher 
will affect the pupil’s performance on an item or on the whole test. 

g. Personality of the tester. The variation possible in the per- 
sonality of the persons who administer a test is almost unlimited. 
For example, the tester may be very aloof and strict in manner, or 
he may be friendly and relatively lax in discipline. The first con- 
dition may produce a negativistic feeling on the part of some pu- 
pils, thus causing a poor quality of work; the same condition may 
cause other more phlegmatic pupils to do a type of work better than 
otherwise would have been the case. On the other hand, the sec- 
ond condition may promote or hinder efficiency in particular cases. 
Further, the same tester may manifest varying attitudes on different 
occasions, depending upon his general state of health, his experience 
immediately preceding, and the like. 

h. Time allowed for taking test. Twenty items to be answered 
in twenty minutes may constitute quite a different test from a test 
made up of the same items to be answered in forty minutes. The 
general social atmosphere in which the test is taken may be rushed 
and tense, or the condition may be such as to promote the feeling of 
freedom and ease. Also, a given pupil may be able to respond ef- 
fectively if he is not under tension and is given time to think, whereas, 
under opposite conditions, his performance may be poor. In the 
case of another pupil, the reverse may be true; that is, he may do his 
best work under the tension of strict requirements. Thus, the sub- 


Educational Implications and Problems 87 


jective decision as to the time limits of a test may affect signifi- 
cantly the resulting pupil scores. 

i, Number of items. The variation in respect to the length of 
the test is theoretically unlimited. A test composed of ten items 
constructed to cover ten pages of subject matter may not be com- 
parable to a test made up of 150 similar items constructed for the 
same purpose. Once the teacher has decided to give a test, she must 
decide how many items are necessary for an adequate test. There is 
little reason to believe that two teachers will in a given case agree 
as to the number of items necessary for a good test. 

j. Type of mental activity required by items. A test may be com- 
posed of items which require specific knowledge reactions, or the 
test may be composed principally of items which demand reflective 
thought. Further, a test may be made up of a combination of these 
two types of mental activity, as well as of many others. A test 
composed of twenty items, each of which requires the pupil to weigh 
a body of evidence and come to a,conclusion, is very different from 
a test of twenty specific factual items based on the same subject 
matter. A given pupil, thus, may be able to do excellent work on 
the first test and very poor work on the second, or the reverse may 
be true. However, both tests may be considered as adequate meas- 
ures of achievement in a given subject by the respective author of 
each test. 

k. The evaluation of pupil responses. There is the problem of 
determining when the response of a child represents the mental 
content contemplated in the test item. The following item appeared 
in a history test: “Instead of states as in the United States the polit- 
ical divisions in Canada are (provinces).” The following are three 
responses made by as many children: (1) provinces, (2) provens, 
(3) profens. Is the mental content the same in each case? Does re- 
sponse (1) represent a better quality of mental reaction than does 
(2) or (3)? 

One should note that the problem raised here is not restricted to 
completion exercises. On the contrary, the problem of truly eval- 
uating a pupil response arises for every type of item. For example, 
when a pupil responds to a true-false item, only the product of his 
reaction is manifested; the mental activity which led to the response 
is completely hidden. 

1. Pupil interpretation of items. The same item may be variously 
interpreted by different children or by the same child at different 
times. The child’s previous experience, his physical and mental 


88 Variability in Results from New-Type Achievement Tests 


state, his conative state and tone, and many other significant factors 
may affect his interpretation of a given item, 

Since a-pupil’s performance on a given test varies with his in- 
terpretation of the items which make up the test, and since this 
interpretation is dependent upon a large number of intricate psycho- 
logical factors, it follows that variation in pupil interpretation of 
test items may be an important cause of variation in test results. 

2. Possibility of absolute measurements. Is it possible to develop 
measures in the social sciences which are both adequate and free 
from subjective elements? There are several positions which may 
be taken with relation to this question. 

First, one may content that the testing situation is so complex 
as to frustrate the attempt to secure measures that take into account 
a sufficient number of the significant variables involved. One of 
the most difficult problems connected with psychological measure- 
ment grows out of the facts (a) that the reaction of that which is 
measured is basic to any measurement whatsoever, and (b) that the 
nature of this reaction depends upon a number of intricate factors. 
That is to say, in psychological measurement the instrument is an 
instrument of measurement only when it is reacted to by the thing 
being measured. The accurateness of the measurement is largely 
dependent upon the activity of the measured object. And further, 
the type and quality of activity of a given mind at a given moment 
depends upon a sensitive mental organization built up during the en- 
tire history of the mind concerned. This marked variability of the 
measured object causes variation in the results of measurement, al- 
though, objectively speaking, the instrument and the method of ap- 
plication remain constant. 

An illustration may clarify this point. If one measures the 
height of a child, except for the fact that the child’s body occupies 
space and is linear in nature, the accuracy of the measurement de- 
pends largely upon the nature of the instrument and the care with 
which it is applied. If the instrument (a steel tape) has been con- 
structed according to certain generally accepted standards and if 
the tape is applied so as to avoid error, the reaction of the object 
measured is a relatively unimportant aspect of the measurement sit- 
uation. However, the situation differs if one attempts to measure a 
child’s knowledge of a process in arithmetic. Suppose ten examples 
are used as the instrument of measurement. In this case the reac- 
tion of the child is the basic aspect of the whole measurement situa- 
tion. The child’s performance is considered as an index to the 


samme staan" alas ia i i eae oe 


Educational Implications and Problems 89 


ability which presumably the performance represents. If for any 
one of numerous possible reasons the child does not respond “nor- 
mally” significant error in the measurement results. 

Further, a combination of specific reactions to specific situations 
may not be a meaningful index to the achievement of the individual 
as a whole. If (a) exact, quantitative measures are possible only 
when a measurable unitary phenomenon has been abstracted and iso- 
lated, (b) a person is essentially a whole rather than a sum of the 
“parts” of which he is constituted, and (c) the purpose of meas- 
urement is to ascertain the efficiency with which the whole organism 
adjusts itself to its environment, it follows then that when one 
measures the parts (as is so accurately done in the physical sciences) 
one may have done little or nothing toward securing a meaningful 
measure of the probable performance of the person as a whole. 

Second, one may share the faith expressed by Thorndike in the 
following statement : 

We have faith that whatever people now measure crudely by mere 
descriptive words, helped out by the comparative and superlative forms, 
can be measured more precisely and conveniently if ingenuity and labor 
are set at the task. We have faith also that the objective products pro- 
duced, rather than the inner condition of the person whence they spring, 
are the proper point of attack for the measurer, at least in our day and 
generation. 

This is obviously the same general creed as that of the physicist or 
chemist or physiologist engaged in quantitative thinking—the same, in- 
deed, as that of modern science in general. And, in general, the nature 
of educational measurements is the same as that of all scientific measure- 
ments.® 


Or finally one may take the closely related position that measure- 
ments of psychological entities comparable to measurements in the 
physical sciences are possible, but that such instruments must be de- 
veloped very gradually as insights into the relationships requisite 
to the production of a refined instrument of measurement are gained. 
An extended quotation from the German psychologist and physicist, 
Wolfgang Kohler, may illuminate this point: 

The problems which Galileo attacked in the seventeenth century could 
be solved quantitatively at once, the qualitative experience of everyday 
life having sufficiently provided the necessary basis. But for the ma- 
jority of psychological problems this is not the case. Where do we have 


that first more or less qualitative knowledge of important functional re- 
lationships in psychology which might become the basis for indirect and 


5 Thorndike, Seventeenth Yearbook for the National Society for the Study of 
Education, pp. 16-17. 


90 Variability in Results from New-Type Achievement Tests 


exact measurement? It does not exist. Since the development of more 
“exact” methods presupposes its existence, our main task must be the 
gathering of that knowledge. In most cases our preliminary advance in 
this direction will have to be crude and qualitative. Whoever protests 
that conclusion in the name of purism does not understand our actual 
situation in psychology; he sees neither the nature of, nor the historical 
background prerequisite for, special quantitative methods. If we wish 
to imitate the physical sciences, we must not imitate them in their con- 
temporary, most developed form; we must imitate them in their historical 
youth, when their state of development was comparable to our own at 
the present time. Otherwise we should behave like boys who try to copy 
the imposing manners of full-grown men without understanding their 
raison d’étre, also without seeing that in development one cannot jump 
over intermediate and preliminary phases. A survey of the history of 
physics is certainly illuminating. Let us imitate the natural sciences, but 
intelligently ! 

Behavior is enormously rich in forms and nuances. Only acknowl- 
edging this wealth, and studying it directly as it is given in all its 
fascinating varieties, shall we become able gradually to find those forms 
of more quantitative, and perhaps more accurate, procedure which may 
become as adequate for us as are the methods of physics in its realm. 
At present, and in this broader historical perspective, qualitative observa- 
tion and analysis may be, in a sense, more exact, i. e., adequate to our sub- 
ject-matter, than much blind measurement. We shall press forward to- 
wards more refined methods, of course; but owing to our situation as be- 
ginners, we can go forward only through the use of less refined methods 
tor the time being.® 


3. Desirability of absolute measurements. If the assumption that 
absolute educational measurements are possible is granted for the 
moment, another problem? arises: Should educational measurements 
be objective in the sense in which the term is used in the physical 
sciences ?8 On the one hand, it is possible that intelligent personal 
judgment, based upon specific and locally determined educational 
aims, is in actual school procedures the most valuable type of meas- 
urement. If the statement just made is sound, the performance re- 
quired and the meaning attributed to a performance would vary from 
community to community and probably from child to child. That 
is, achievement would be considered as a relative concept, its inter- 


® Kohler, op. cit., pp. 43-44. See also J. P. Brown, “A Methodological Con- 
sideration of the Problem of Psychometrics,” Erkenntnis, iv Band (1934) 
Heft 1, pp. 46-61. 

™The problem relating to the desirability of absolute measurements seems to 
be essentially a problem for research in the philosophy of education. The re- 
sults of such studies should contribute much to the development of effective 
educational measurements. 

® See quotations from Thorndike given on p. 21 for elucidation of the mean- 
ing of this type of measurement. 


Educational Implications and Problems 91 


pretation in any given case depending upon the aims of the educator 
with respect to the factors of the specific situation. In essence this 
position rests upon the belief that the first and most basic problem 
of educational measurement is the ascertaining of educational aims; 
and that since tests are constructed to determine the extent to which 
the aims or objectives have been realized, the nature of the tests must 
grow out of the nature of the objectives.® Testing, according to 
this view, is essentially a process of gathering evidence upon which 
a decision as to the presence of the learning product may be based.!° 

On the other hand, progress in educational procedures may de- 
pend chiefly upon the development of exact and uniform measuring 
instruments which are capable of yielding accurate and unambiguous 
measures of educational achievement.!1! Perhaps the goal of test 
construction should be the production of universally applicable tests 
and scales in all subjects to the end that a unit of achievement in 
arithmetic at one developmental level (or for a given child), for ex- 
ample, may be comparable to a unit of achievement in that subject 
at any other level (or for any other child). 

Finally, the positions stated in the two preceding paragraphs may 
be subject to effective synthesis. Eventually, uniformity may be 
attained in some educational objectives in such degree as to permit 
uniform standards. In these cases exact (in a limited sense) and 
nonpersonal measuring procedures may have a contribution to 
make to those aspects of educational measurements for which the 
most accurate (in terms of specific aims) measure involves a large 
portion of the measurer’s personal judgment. For although this 
personal judgment must function within a frame of reference which 
is personal, it may be desirable that the judgments made within this 
frame of reference be as nonpersonal or objective as possible. 

®The significance of objectives for adequate measurement has been very 


ably outlined and defended by R. W. Tyler. For extended and illuminating dis- 
cussion see articles by Tyler appearing in Educational Research Bulletin (Co- 


‘lumbus: Bureau of Educational Research, Ohio State University), Vols. XI to 


XIV, inclusive. 

H.C. Morrison, The Practice of Teaching in the Secondary Schools 
(Chicago: The University of Chicago Press, 1926), chap. v. 

“For a critical discussion of this point see W. A. Brownell, “The Use of 
Objective Measures in Evaluating Instruction,” Educational Method, XIII 
(May-June, 1934), 401-408. 


APPENDIX 


Tasre II. Number AND Type or Items IN Eacu or SIxXTY-THREE 
TEACHER-MAbDE OpjecTIve TESTS 





Number or Various Tyres or Item 





Test Testing | Comple- 
Unit tion Matching 

IGS ax Identical 20 
TGS ese Z 10 
IlIGS Hs 
IVGS. s 
VGSr ans if 20 
VIIGS. os 20 
VILIGS 7 10 
1XG5S. se 15 
XIIGS as 20 
XILIGS oO 
IG6r cee se 40 
AIGG6 ee. fe 15 
IVG6..... ss 
VIG6ES ee oe 9 
VIIIG6 “s 5 
IXG6. se 
RGGnecrrs ss 5 
XTG6er) me 10 
XIIIG6 .. e 7 we 
IGiieieces Semester 5 15 
MIG 7eer ss 50 
I11G7 se 
IVG7> ene. fs 25 
VG7 inane cs 10 
MEDS eevee Identical 
IIHS. sf 
IVHS. ac 5 10 
WES eines: ss 20 
VITHS. st 10 
VIITHS. ef 12 
IXHS. ss 
XITHS uy 25 
THG ere ES 10 
ITHG are e 25 
IVH6. Ss 10 <= 
IVH6A s 20 
VIH6.. <f 9 5 
VIIIH6. qi ll 
XHGree ef 17 
XIH6 x 15 
XIITH6 ee 
TH 7aviees Semester 5 
TED Sees ss 5 
THficaeee ss 5 Je 
TI /asen ss 10 25 
IIH7b.... re 10 30 
IIH7c.... “ 10 30 


*I, Teacher I; G, Geography; 5, Grade 5. 


Multiple- 
Choice 


Aun: 


True— 
False 


25 
10 
15 
10 
15 


30 
22 


Location 


18 


Unclassi- 
fied 


16 


Total 
Number 
of Items 


a 


Appendix 93 


Taste II. (Continued) 





Numser or Various Types oF ITEM 





Test Testing | Comple- Multiple- True Unclassi- lad 

Unit tion Matching Choice False Location fied of Items 
IfH7d Ke 10 30 10 50 
IlH7e ae 10 30 10 50 
IITH7 Ce 25 Ns 25 50 
IVH7 ce 10 30 20 40 100 
WER. «/2< wu 20 std 20 30 70 
VIH7 ee 10 20 10 10 50 
IHSF ss 25 25 50 100 
IIHSF ce 30 20 20 10 80 
IHSB FS 20 15 30 40 105 
IIHSB sf 20 20 25 65 
IHSH Identical 23 15 16 25 69 
ITHSH « 15 18 20 53 
IITHSH. SS 12 15 12 20 59 
IVHSH.. « 8 15 16 39 
THe6..... a 10 5 10 25 
IlHe6. . as 25 25 





TABLE XLV. ARITHMETIC REASONINGC—GRADE-EQUIVALENT DISPARITY 
BETWEEN TESTS a, b, AND c IN TERMS OF MONTHS 





Test a From Test b Test a From Test c Test 5 From Test c 

: Cumulative Cumulative Cumulative 

Months of Number of | Percentage | Number of | Percentage | Number of | Percentage 

Disparity Pupils Distribution Pupils Distribution Pupils Distribution 
BGS Sirians cists = 01n,015 1 0.22 
Bode oie wince © are 1 0.44 
O32 serisisis.sis.sinisi- 1 0.22 2 0.87 
DEO ness siciclae 1 0.22 5 1.31 1 1.09 
DAH2G ssi cis e.6 oa 2 0.65 d 2.18 3 1.74 
7) ek ee ears 1 0.87 8 3.92 8 3.48 
ESR 2O i cere 35°06 sine 8 2.61 14 6.96 18 ve) 
MSU estates, «fered 12 Smee 26 12.61 24 12.61 
ZU Ene Gelsictecs = sss! 26 10.87 37 20.65 48 23.05 
OU eeniscctes 52 22.17 61 33.91 77 39.79 
GeO iils ithaialsa:0's:05 89 41.52 96 54.78 94 60.22 
SY BRO CO OCREIES 149 73.91 117 80.22 85 78.70 
Qa 2erierts cveic> ve: 120 100.00 91 100.00 98 100.00 

WPOtale Sccrosis sn 460 460 460 





94 Variability in Results from New-T ype Achievement Tests . 


TaBLe XLVI. GeocRAaPHY—GRADE-EQUIVALENT DISPARITY BETWEEN TESTS 
a, b, AND c IN TERMS oF MonTHS 











Test a From Test b Test a From Test c Test b From Test ¢ 
Cumulative Cumulative Cumulative 
Months of Number of | Percentage | Number of | Percentage | Number of | Percentage 
Disparity Pupils Distribution Pupils Distribution Pupils Distribution 
BOA Shins avnieie 1 0.22 is nee 
2940s caer hseee oe ahora 4 1.09 2 0.43 
BERS Es cieinaieeiets sis is ate 6 2.40 Zs 0.43 
SGI ee arta miaistersia a 1 11 4.79 a 0.43 
KLOET VA erchaecrerers 1.31 17 8.49 8 Qt 
Ame er petal ata 7 5 2.40 21 13.06 15 5.43 
Em IES, rece cickere tt 10 4.57 26 18.71 8 Teal 
QR 2B ars fe-cisteleccie-s 15 7.83 21 23.28 18 11.09 
1 8= 20 reat persereksne 18 11.75 33 30.45 20 15.44 
USSU Accslne ore nsessi= 29 18.05 41 39.36 30 21.96 
W214 aaeca are: 37 26.09 56 51.53 41 30.87 
Qe, Matvete a0 0 016 68 40.87 52 62.83 60 43.92 
GSR ester 74 56.96 64 76.74 84 62.18 
SOP ieisleinrsieerne 109 80.65 50 87.61 91 81.96 
a2 ee avatertrezerere 89 100.00 57 100.00 83 100.00 
Total Hivjetere?starsieterss= 460 460 460 
Mean ern entaciactice 9.0 14.1 10.0 








Taste XLVII. HeEALTH—GRADE-EQUIVALENT DISPARITY BETWEEN TESTS 
b AND c IN TERMS OF MONTHS 








Test b From Test c 


Months of Disparity 

















Cumulative 
Number of Pupils Percentage Distribution 
48 — 50 1 0.22 
GS Ph ey aratecrare ere ayes whole Wien fot lele Ew is ola rere eerer ss 0.22 
GD Ae 2 Naish as Beccsrsre robe aise nis Atenieerele terres 0.22 
39 —VAM sta cettarareiehe Wavsresere Gisie Se OLTOR SCORER COREE 3 0.87 
3G SB accross oe eadae rotate onto etol ee tateiete ecststeeiepeeete 3 dese 
935195), ais horcnsaisseth eeis Gate GE ecto Meee eke ae oie 3 PLANES, 
BO 82h od ba isretee oetemiersereceietesiteternen ie monies 6 3.48 
DT 2D erlecd isevalalefo saya acapolateis ie cyan ieee tose eet 2 5.44 
DA DG sia train's, vinta ote vate ayes ere Tene era corar sor erere Tore 12 8.05 
DY DB Wy a Faverade arte cignccten eee te oe ee teetne 32 15.00 
18) SH" DO sr Siee woot avalerasaye. aa estate piereonro en aroetoheraree 40 23.69 
VS ST sia es Sear ac crete ethene a ere pege epee eae eter 41 32.60 
V2) "TAG acess alee tie iaraysisin a inverse eente seeieeie ieioe ovale 54 44.34 
9) SID eentes aise e presse ists creertae ciao ttereete 60 57.39 
GB sve rsiahescraricare ob recep srarccnrs Bane e eieieielorovereeteseete 68 72.17 
eam cas orn ere ee Chee a alae ralerera le ole oan ar eee aterm 59 85.00 
Di avin ressisteletsteiaitart testi sree cist ae er eee 69 100.00 
Wotall: ccc steucieted vsleion Maree 460 





Appendix 95 


TasLe XLVIII. History—GrapE-EQuIVALENT DISPARITY BETWEEN TESTS 
a, b, AND c IN TERMS OF MoNTHS 


Test a From Test Test a From Test c¢ Test 6 From Test c 
Cumulative Cumulative Cumulative 
Months of Number of | Percentage | Number of | Percentage | Number of | Percentage 
Disparity Pupils Distribution Pupils Distribution Pupils Distribution 
Oar felalsjeseici3/sis's 2 0.43 
BAe tars ccrerinve'e 0.43 
BGs uaiaiapyelssevc's's 1 0.65 
BBS) aie c\aye:eiets 53 Na 5 1.74 4 
Osea ctenevactie 1 0.22 2 Zale 4 1.74 
TIO 8 casjaisiaieies sis 0.22 13 5.00 10 3.91 
DA DORA kala) sive xsi 4 1.09 21 OST, 12 6.52 
DDS) setae <a eraser 6 2.39 34 16.96 22 11.30 
NS 2O MCR ate ssyeiete s,s 21 6.97 31 23.69 30 17.82 
DSU oaiere aac ecacs 36 14.80 65 37.83 51 28.91 
12S IA cine asics 55 26.76 66 52.17 56 41.08 
SUL ers tries tats 6s 84 45.02 73 68.04 59 53.91 
= Olrareists sie; acis aie 87 63.92 51 79.13 81 mle se 
BOM ners at staler 88 83.05 53 90.65 75 87.83 
Qo2maeh mete ues: <tc 78 100.00 43 100.00 56 100.00 
PD Gtalaracsstetejats 010) < 460 460 460 
Meadnitncterasatciciaus 8.8 13.2 11.3 


TaBLE XLIX. LANGuAGE UsAceE—GrADE-EQUIVALENT DISPARITY BETWEEN 
Tests a, b, AND c IN TERMS OF MONTHS 


Test a From Test b Test a From Test c Test b From Test ¢ 
Cumulative Cumulative Cumulative 
Months of Number of | Percentage | Number of | Percentage | Number of | Percentage 
Disparity Pupils Distribution Pupils Distribution Pupils Distribution 
COEG2ointisieiscevetole ss 1 0.22 
Bree i a , as 2 0.65 
ADAM re safes ek brs 2 1.08 
Smale apersiaies fats 2 1.51 
BG-Beecssss cscs ols 5 2.60 
BOS Siete si 2 0.43 2 0.43 9 4.56 
BUSS 2ecciers «tears 15 3.69 7 1.95 1l 6.95 
DDD era cay aiars os 16 lid, 12 4.56 22 11.73 
DAD Gee ateiateo\s(s oe 12 9.78 20 8.91 13 14.56 
DLO et creybicig a\0)0%s 26 15.43 36 16.74 42 23.69 
HIS=20ceevevetewe sieiale.= 45 25.21 38 25.00 39 32.17 
5 Waewajaviewis« x 57 37.60 52 36.30 36 40.00 
2A cia csisie-'< 49 48.25 57 48.69 42 49.13 
Qe ei ar-yavetaiots’s) 3:2 58 60.86 65 62.83 39 57.61 
GB a taciiericiie ns 74 76.95 54 74.57 52 68.91 
Be Mert iets a:stexoie on: 55 88.91 56 86.74 90 88.48 
OSD ieeve aie rote: 5155 51 100.00 61 100.00 53 100.00 
BLOkal ease tice telatetnte'- 460 460 460 
MVTeatnereistate ateietn ratte 12.7 12.4 1307, 


96 Variability in Results from New-Type Achievement Tests . 


TABLE L. LirerATURE—GRADE-EQUIVALENT DISPARITY BETWEEN TESTS @ AND C 
IN TERMS or MonTHS 





Test a From Test ¢ 





Months of Disparity 
, Cumulative 
Number of Pupils Percentage Distribution 





Lm LP ia tbie Sareea d elelorn ove ated Rieveve eles is, SVE asd tater eRe 5 1.09 





DA mmDIDG Ss cvs reveal ate stecue eT iste ecstatetaroe etree 16 4.57 
NE ae DS ahaa cie\Ova iain oie areustwte pina aie ete acini cay stereo aes 22 9.35 
DB) 2D ne ora letany wrarare cere core orn tele watt tarccn SAN SENS 38 17.61 
US He NZ raravc afar chavs vere orniaroeetnus eiola ints intoist botstata area eaTerR 41 26.52 
MD mae TAY titre vere sielatcnerctic cuslaralne iaen ie iarere ene 53 38.04 
Qh PELs refiayavarace eon vtera dh aeareroere Beveyernde atenneaie soak ps aes 69 53.04 
GiB debit deriadinn sapien eh nine Giriehtete 76 69.56 
Bh te eceseteialaleyetaretel dia raatetie 3 Vterststs alaatane tonatateeeaers 80 86.95 
Dh Ses, Pace clstel ro oteecesraverehefeueiate obeinye charaters Mites earete 60 100.00 

Total sete: Sarsteh eyaiorelese nie tatsiotate ole eateries 460 

Wea ni) cts ctatsrs Boxns per Aaimctersiae erie 10.7 











TasLe LI. READING—PARAGRAPH MEANING—GRADE-EQUIVALENT DISPARITY 
BETWEEN TESTS a, b, AND c IN TERMS OF MONTHS 

















Test a From Test > Test a From Test c Test b From Test c : 
a 
Cumulative Cumulative Cumulative 

Months of Number of | Percentage | Number of | Percentage | Number of | Percentage / 
Disparity Pupils Distribution Pupils Distribution Pupils Distribution . 
SHAT etecte 1 0.22 
BOB. iiiseicaaeca 1 0.44 
4940 cher ee: 1 0.66 | 
SOB 8 vedicranieere arate 3 1.31 Sidhe ahs 
3335). ccs elsletecssiers 1 153) 5 1.09 3 0.65 
B0H32 i cemcsepitiaterie 3 2.18 1 eo 1 0.87 
D2 Faas 6 3.49 5 2.40 Zz 1.30 \ 
DE-2G: visnicdante ste 16 6.97 13 Si23 10 3.47 

DN D3) eter 10 9.14 17 8.92 5 4.56 
18=20) oer caress 36 16.97 38 17.18 19 8.69 
15217 eee 45 26.75 49 27.83 38 16.95 | 
V2=14) svete tree. 46 36.75 42 36.96 59 29.78 
Qa icterasie cle: 71 52.18 50 47.83 60 42.83 
628 Siero 68 66.96 77 64.56 62 56.31 
SS renou acca ar 92 86.96 88 83.69 120 82.39 7 

Qe 2a caatseee 60 100.00 75 100.00 81 100.00 
otal sleet 460 460 460 
Mean. 27.0400: 10.8 10.4 8.9 


. Appendix 97 


TABLE LIT. SpELLING—GRADE-EQUIVALENT DISPARITY BETWEEN TESTS 
a, b, AND c IN TERMS OF MoNTHS 





Test a From Test db Test a From Test c Test 6 From Test c¢ 
Cumulative Cumulative Cumulative 
Months of Number of | Percentage | Number of | Percentage | Number of | Percentage 
Disparity Pupils Distribution Pupils Distribution Pupils Distribution 
SORE eicsis's ieee 1 0.22 
BG ABE eleke eteiaie <2): 0.22 
BSI cp lametsis,s/0)015 0.22 
GUS 2 ec ce ccatee crs 1 0.44 
IDO rte eva ais /<\% 2 0.43 ahs a 0.44 
DADO re asics a\siais 0.43 1 0.22 1 0.66 
DDS es aa ayararas 4 1.30 2 0.65 3 1s 
G20 Brees ae oie 5 2.39 4 1.52 3 1.96 
DSU Aan taccie aces 9 4.35 8 3.26 10 4.13 
NDNA oe, ofs ajoseraverevs 33 11.52 16 6.74 17 7.83 
GaN alas, cece 6l 24.78 22 11.52 44 17.39 
Gite it otleisresics 99 46.30 70 26.74 114 42.17 
BSR ratte se aya 126 73.69 157 60.87 123 68.91 
Qe De eee ei diaie ret 121 100.00 180 100.00 143 100.00 
Motaloe euicceacisis,: 460 460 460 
Meanvss aesien:s <1. 6.5 4.8 5.9 


TasiLe LIII. Worp MEANING—DISPARITY BETWEEN TESTS @ AND c IN TERMS 








or MontTHs 
Test a From Test c 
Months of Disparity 
Cumulative 
Number of Pupils Percentage Distribution 
FF na ae clara alaisats hale aieleieniecieioae canes cewe ass 1 0.22 
Fed Dee renee AU Te eves a Sie lee win iS [ead ee 1510 Sa ee 0.22 
A EN ete er sella at ek eff Trace sow Sas ele v\D eo 1 0.44 
DA ED teisis eissiste wie clasaserciavslere ols 6 sins sles stew es 1 0.66 
eS ee Phere erae ANC ate A Sis /e Sie Soe nyeie Sid stale 3 1.31 
Si CO ete atthe i taaietrncidianecars &Saisie Sisrsaw aka 20 5.65 
AU ttaat eeeiaieateialctes stain tack ciao sieide Saree 3 4 28 11.73 
Eee he ote) sftctele (oie te c.ctc/aje\x c)s!e1d baa 6 siata!a/e:a ave 32 18.68 
OR ereeareeret cisichetely’ s states octets seis siovsis eG e/eyarereys 51 29.76 
GON reerae arararsy sropatalcievaleraiess «ayes a) cei bie oie vieve,s a. sie 75 46.10 
et Nee ciclere, cane 9215 cco (cle ais eie'elsies ae 131 74.57 
WD cre cca le ei ctecer ate lacn:ateyersiecc 80:80 6 jaeresaiela/ereis 117 100.00 
pL co Gaallevareney date) since sie ecesn.a'e. ate, choise. ere) cvwiere 460 


BIBLIOGRAPHY 


Note: This bibliography consists only of titles that have been referred to 
in the text. Several extensive bibliographies, both on the general subject of 
educational measurement and on the more limited subject of objective or 
new-type tests, are available. These more complete bibliographies appear as 
starred (*) references in the list which follows. 


*“Bibliography on Educational Tests and Their Use,” Review of Educa- 
tional Research, 111 (Feb., 1933), 62-80. 

Brown, J. F. “A Methodological Consideration of the Problems of Psycho- 
metrics,” Erkenntnis, iv Band (1934) Heft 1, pp. 46-61. 

Brownell, William A. “On the Accuracy with Which Reliability May Be 
Measured by Calculating Test Halves,” Journal of Experimental Edu- 
cation, I (March, 1933), 204-215. 

“The Use of Objective Tests in Evaluating Instruction,” Edu- 
cational Method, XIII (May-June, 1934), 401-408. 

Chadwick, E. B. “Statistics of Educational Results,” The Museum, A 
Quarterly Magazine of Educational Literature and Science, III (Jan., 
1864), 479-484. 

Dingle, Herbert. Science and Human Experience. New York: The Mac- 
millan Company, 1932. 141 pp. 

Foran, T. G., and Loyes, Sister M. Edmund. “The Relative Difficulty of 
Three Achievement Examinations,’ The Journal of Educational Psy- 
chology, XXVI (March, 1935), 218-222. 

Garrett, Henry E. Statistics in Psychology and Education. New York: 
Longmans, Green, and Company, 1926. 317 pp. 

Johnson, F. W. “A Study of High School Grades,” School Review, XIX 
(Jan., 1911), 13-24. 

*Jones, Vernon, and Brown, Robert H. “Educational Tests,” Psycholog- 
ical Bulletin, XXXII (July, 1934), 473-499. 

Kelly, F. J. Teachers’ Marks, Their Variability and Standardization. 
Teachers College Contributions to Education, No. 66. New York: 
Teachers College, Columbia University, 1914. 139 pp. 

Kohler, Wolfgang. Gestalt Psychology. New York: Horace Liveright, 
Inc., 1929. 403 pp. 

*Lee, J. Murray, and Symonds, P. M. “New-Type or Objective Tests: A 
Summary of Recent Investigations,” The Journal of Educational Psy- 
chology, XXIV (Jan., 1933), 21-38. 

*________ “New-Type or Objective Tests: A Summary of Recent Investi- 
gations,” The Journal of Educational Psychology, XXV (March, 
1934), 185-191. 

McCall, William A. How to Measure in Education. New York: The 
Macmillan Company, 1922. 416 pp. 

Monroe, Walter Scott. An Introduction to the Theory of Educational 
Measurements. New York: Houghton Mifflin Company, 1923. 364 pp. 

Monroe, W. S., and Souders, L. B. Present Status of Written Examina- 
tions and Suggestions for Their Improvement.. Bulletin No. 17. Ur- 





Bibliography 99 


bana: College of Education, Bureau of Educational Research, Uni- 
versity of Illinois, 1923. 77 pp. 

Morrison, Henry C. The Practice of Teaching in the Secondary School. 
Chicago: The University of Chicago Press, 1926. 661 pp. 

Odell, C. W. Educational Measurement in High School. New York: The 

Century Company, 1930. 641 pp. 

A Glossary of Three Hundred Terms Used in Educational Meas- 
urement and Research. Bulletin No. 40. Urbana: College of Education, 
Bureau of Educational Research, University of Illinois, 1928. 68 pp. 

A Selected Annotated Bibliography Dealing with Examinations 
and School Marks. Bulletin No. 43. Urbana: College of Education, 

Bureau of Educational Research, University of Illinois, 1929. 42 pp. 

Pullias, E. V. “A Study of Current Opinion Concerning Objective Tests,” 
Educational Method, XVI (April, 1937), 348-356. 

Rice, J. M. “The Futility of the Spelling Grind,’ Forum, XXIII (June, 
1897), 409-419, 

“The Futility of the Spelling Grind,” Forum, XXIII (April, 
1897), 163-172. 

Ritchie, A. D. Scientific Method: An Inquiry into the Character and 
Validity of Natural Laws. London: Kegan Paul, 1923. 204 pp. 

Robinson, Daniel Sommer. The Principles of Reasoning: An Introduction 
to Logic and Scientific Method. New York: D. Appleton and Company, 
1924. 393 pp. 

*Ruch, G. M. The Objective or New-Type Examination. New York: 
Scott, Foresman and Company, 1929. 478 pp. 

Ruch, G. M., De Graff, M. H., Gordon, W. E.... (and others). Ob- 
jective Examination Methods in the Social Studies. New York: Scott, 
Foresman and Company, 1926. 116 pp. 

Russell, Charles. Standard Tests. Boston: Ginn and Company, 1930. 
516 pp. 

Sandon, Frank. “The Necessary Imperfections of an Examination,” The 
British Journal of Educational Psychology, V (June, 1935), 180-192. 

*Smith, Henry Lester, and Wright, Wendell William. Second Revision of 

the Bibliography of Educational Measurements. Bulletin of the School 
of Education, Vol. IV, No. 2. Bloomington: Bureau of Co-operative 
Research, Indiana University, 1927. 251 pp. 

—— Tests and Measurements. New York: Silver, Burdett and Com- 

pany, 1928. 540 pp. 

Starch, Daniel. Educational Measurements. New York: The Macmillan 
Company, 1918. 202 pp. 

Starch, Daniel, and Elliott, E. C. “The Reliability of Grading High 
School Work in Mathematics,” School Review, XXI (April, 1913), 
254-259. 

Thorndike, Edward L. An Introduction to the Theory of Mental and So- 
cial Measurements. (Revised Edition, 1913.) New York: Teachers 
College, Columbia University, 1904. 277 pp. 

“The Nature, Purposes, and General Methods of Measurements 
of Educational Products,” The Measurement of Educational Products. 
Seventeenth Yearbook of the National Society for the Study of Edu- 


* 


* 


Ca ee er | eo U NaN a ae al) Aa 









“4 = ty 


100 Variability in Results from New-Type Achievem 


cation, Part II. Bloomington, Illinois: Public School Pu 
pany, 1918. Pp. 16-24. 
Tyler, R. W. “Formulating Objectives for Tests,” Educati 
Bulletin, Columbus, Bureau of Educational Research, Ohio 
versity, XII (Oct. 11, 1933), 197-206. 
Weidemann, Charles C. How to Construct the True-False Exc 
Contributions to Education, No. 225. New York: — 
Columbia University, 1926. 118 pp. 


XLII (March, 1935), 153-165. 














